Hello prefect team I am having a lot of trouble with one of Prefect Community #ask-community

Hello prefect team, I am having a lot of trouble ...

Nick Torba

09/07/2025, 12:34 AM

Hello prefect team, I am having a lot of trouble with one of our ecs push work pools. I commented on a related issue here: https://github.com/PrefectHQ/prefect/issues/18429#issuecomment-3263294213 Each night, we have a wave of ~300 runs get submitted. They all end up running many hours late because all the runs sit in LATE status. The concurrency limit on the work pool is 20, and the deployments don't have concurrency limits. I don't think it is an issue with ECS infra, because I can't see any logs, it looks like the runs are never being submitted at all. But eventually, they do. It seems like maybe something on the prefect backend is telling cloud not to submit them? But I am not sure where to look to get this information if it is something on our ECS side that could be causing this.

Nate

09/07/2025, 1:37 AM

hi @Nick Torba - can you share what version of

prefect-aws

you're using? just as a fyi, we have been working on improving the ECS worker, if you're willing/able to check it out

Nick Torba

09/07/2025, 2:34 AM

This is a push work pool, so we aren't running the ECS worker. @Nate

Nate

09/07/2025, 2:35 AM

ah i see. i would open an issue/discussion about this then

Nate

09/07/2025, 2:36 AM

might be worth checking for old/zombie runs in LATE/CANCELLING etc

The concurrency limit on the work pool is 20

Yaron Levi

09/07/2025, 10:57 AM

👀

Nick Torba

09/07/2025, 10:32 PM

All of the LATE runs aren't zombies. They are the ones that have been submitted to prefect cloud, but have not yet been submitted to AWS ECS infra. I don't see any zombie CANCELLING. There were a handful of unexepcted CANCELLED (that i think should have been failed), but the concurrency limit was much higher than the number of cancelled. I will open a new ticket tonight

Nick Torba

09/08/2025, 7:05 PM

Hey @Nate @Yaron Levi, Thanks for the replies over the weekend. I made a ticket here last night: https://github.com/PrefectHQ/prefect/issues/18877 Any idea how long it'll take to have someone dig into this?

Nate

09/08/2025, 7:09 PM

thanks for opening the issue! we'll look into it as soon as we can, feel free to email help@prefect.io if you need something urgently

Yaron Levi

09/10/2025, 2:12 PM

@Nate @Nick Torba We started having in the last couple of weeks errors of this type:

Yaron Levi

09/10/2025, 2:12 PM

Copy code

Flow run could not be submitted to infrastructure: TaskFailedToStart - CannotPullContainerError: ref pull has been retried 1 time(s): failed to copy: httpReadSeeker: failed open: unexpected status code <https://registry-1.docker.io/v2/yaronlevi/prefect/manifests/sha256:df9119a64997599ed66d07a1742ad9a405a73ed5fe006d4848a41d6eaee9e8e4>: 504 Gateway Time-out

Yaron Levi

09/10/2025, 2:12 PM

Copy code

Flow run could not be submitted to infrastructure: TaskFailedToStart - CannotPullContainerError: ref pull has been retried 1 time(s): failed to copy: httpReadSeeker: failed open: unexpected status code <https://registry-1.docker.io/v2/yaronlevi/prefect/blobs/sha256:f5cc5422ebcbbf01f9cd227d36de9dd7e133e1fc6d852f3b0c65260ab58f99f3>: 504 Gateway Time-out

Nate

09/10/2025, 2:12 PM

those appear to be network timeouts with your registry

Yaron Levi

09/10/2025, 2:12 PM

Using ECS push workpools + prefect cloud

Yaron Levi

09/10/2025, 2:13 PM

So this is a problem that docker.io should handle?

Yaron Levi

09/10/2025, 2:14 PM

Maybe Prefect can make this more robust internally?

Nick Torba

09/10/2025, 2:15 PM

@Yaron Levi I think with the bug I posted, we aren't even getting to this stage, because the task is never getting submitted to ECS

👍 1

Yaron Levi

09/10/2025, 2:16 PM

I think the orchestrator (Prefect) should take e2e responsibility for this. Because the other way, is saying: "This thing is behind this API, go talk to them"

Nate

09/10/2025, 2:17 PM

Maybe Prefect can make this more robust internally?

its possible that we could retry internally in prefect cloud. this is not a commonly reported issue, most people pull from their ECR if they're already in ECS feel free to open an issue though!

👍 1

3 Views

Open in Slack

Previous Next