< Jake Kaplan> Hi just a heads up We ve just saw some error Prefect Community #ask-community

<@U02EBC3N05N> Hi, just a heads up. We’ve just saw...

Yaron Levi

07/24/2024, 2:05 PM

@Jake Kaplan Hi, just a heads up. We’ve just saw some error spike regarding Prefect Cloud + ECS Push pool: https://github.com/PrefectHQ/prefect/issues/14730

Yaron Levi

07/24/2024, 2:06 PM

If the team needs more context/info just message me.

Jake Kaplan

07/24/2024, 2:20 PM

Hey! Unfortunately this is a network issue between your AWS account and DockerHub. https://repost.aws/knowledge-center/ecs-pull-container-error

Jake Kaplan

07/24/2024, 2:21 PM

If you're getting a constant stream of errors and theres no outage on AWS or dockerhub, something in your network setup may have changed?

Yaron Levi

07/24/2024, 2:25 PM

Nothing has changed from our side…

Yaron Levi

07/24/2024, 2:26 PM

We saw 3 failed runs

Jake Kaplan

07/24/2024, 2:34 PM

In that case it sounds like it's a transient error between AWS / dockerhub. It's unfortunately not uncommon for dockerhub to have transient network failures. I don't know if there is an ECS setting that controls retries on trying to pull the image. This could be a good use case for setting up an automation? To retry the run if it crashes during setup

Yaron Levi

07/24/2024, 2:43 PM

I think it should be implemented with an “automation” but in the Prefect side. Meaning not those automation you define in Prefect’s UI. Because those notification can’t target a step that deep in the system.

Yaron Levi

07/24/2024, 2:44 PM

Maybe Prefect’s offering regarding the ECS connector as a whole could be more robust, and handle internally such cases and “hide” them from the user.

Yaron Levi

07/24/2024, 2:44 PM

It’s point to think about…

Jake Kaplan

07/24/2024, 4:34 PM

For sure! this type of error is something that definitely a little elusive in that it exists somewhat outside of the immediate control that prefect has. For example, say a push pool work pool gets an API error attempting to hit the create ECS task run endpoint. That is very much within the domain of a push work pool trying and if theres a transient network error we obscure that from the user and will try again. In this case the failure is between ECS and another external system. Prefect is just reporting a failure that it has observed. I totally understand that a user though just wants things to work and not fail! While prefect is taking the first step in reporting that failure to the user, theres no automatic action because ultimately we don't really know what the failure is or how to remedy. Thanks again for flagging this though. Your feedback is always really welcome and important! I will for sure give some more thought about how this might fit into things retrying automatically

👍 1

Yaron Levi

07/25/2024, 11:37 AM

We are seeing more and more of those 504 errors in the last few days

Yaron Levi

07/25/2024, 11:37 AM

Here is a fresh one for example:

Yaron Levi

07/25/2024, 11:37 AM

https://app.prefect.cloud/account/8eed9803-456a-4126-a7f7-074aa44aa1b2/workspace/8ff9[…]57758d46d/runs/flow-run/03a420c7-797e-42b9-a509-9019b02b32d7

Yaron Levi

07/25/2024, 11:38 AM

CleanShot 2024-07-25 at 14.37.59@2x.jpg

Open in Slack

Previous Next