<@U02EBC3N05N> Hi, just a heads up. We’ve just saw...
# ask-community
y
@Jake Kaplan Hi, just a heads up. We’ve just saw some error spike regarding Prefect Cloud + ECS Push pool: https://github.com/PrefectHQ/prefect/issues/14730
If the team needs more context/info just message me.
j
Hey! Unfortunately this is a network issue between your AWS account and DockerHub. https://repost.aws/knowledge-center/ecs-pull-container-error
If you're getting a constant stream of errors and theres no outage on AWS or dockerhub, something in your network setup may have changed?
y
Nothing has changed from our side…
We saw 3 failed runs
j
In that case it sounds like it's a transient error between AWS / dockerhub. It's unfortunately not uncommon for dockerhub to have transient network failures. I don't know if there is an ECS setting that controls retries on trying to pull the image. This could be a good use case for setting up an automation? To retry the run if it crashes during setup
y
I think it should be implemented with an “automation” but in the Prefect side. Meaning not those automation you define in Prefect’s UI. Because those notification can’t target a step that deep in the system.
Maybe Prefect’s offering regarding the ECS connector as a whole could be more robust, and handle internally such cases and “hide” them from the user.
It’s point to think about…
j
For sure! this type of error is something that definitely a little elusive in that it exists somewhat outside of the immediate control that prefect has. For example, say a push pool work pool gets an API error attempting to hit the create ECS task run endpoint. That is very much within the domain of a push work pool trying and if theres a transient network error we obscure that from the user and will try again. In this case the failure is between ECS and another external system. Prefect is just reporting a failure that it has observed. I totally understand that a user though just wants things to work and not fail! While prefect is taking the first step in reporting that failure to the user, theres no automatic action because ultimately we don't really know what the failure is or how to remedy. Thanks again for flagging this though. Your feedback is always really welcome and important! I will for sure give some more thought about how this might fit into things retrying automatically
👍 1
y
We are seeing more and more of those 504 errors in the last few days
Here is a fresh one for example:
CleanShot 2024-07-25 at 14.37.59@2x.jpg