Hello prefect team, I am having a lot of trouble ...
# ask-community
n
Hello prefect team, I am having a lot of trouble with one of our ecs push work pools. I commented on a related issue here: https://github.com/PrefectHQ/prefect/issues/18429#issuecomment-3263294213 Each night, we have a wave of ~300 runs get submitted. They all end up running many hours late because all the runs sit in LATE status. The concurrency limit on the work pool is 20, and the deployments don't have concurrency limits. I don't think it is an issue with ECS infra, because I can't see any logs, it looks like the runs are never being submitted at all. But eventually, they do. It seems like maybe something on the prefect backend is telling cloud not to submit them? But I am not sure where to look to get this information if it is something on our ECS side that could be causing this.
n
hi @Nick Torba - can you share what version of
prefect-aws
you're using? just as a fyi, we have been working on improving the ECS worker, if you're willing/able to check it out
n
This is a push work pool, so we aren't running the ECS worker. @Nate
n
ah i see. i would open an issue/discussion about this then
might be worth checking for old/zombie runs in LATE/CANCELLING etc
The concurrency limit on the work pool is 20
y
👀
n
All of the LATE runs aren't zombies. They are the ones that have been submitted to prefect cloud, but have not yet been submitted to AWS ECS infra. I don't see any zombie CANCELLING. There were a handful of unexepcted CANCELLED (that i think should have been failed), but the concurrency limit was much higher than the number of cancelled. I will open a new ticket tonight
Hey @Nate @Yaron Levi, Thanks for the replies over the weekend. I made a ticket here last night: https://github.com/PrefectHQ/prefect/issues/18877 Any idea how long it'll take to have someone dig into this?
n
thanks for opening the issue! we'll look into it as soon as we can, feel free to email help@prefect.io if you need something urgently
y
@Nate @Nick Torba We started having in the last couple of weeks errors of this type:
Copy code
Flow run could not be submitted to infrastructure: TaskFailedToStart - CannotPullContainerError: ref pull has been retried 1 time(s): failed to copy: httpReadSeeker: failed open: unexpected status code <https://registry-1.docker.io/v2/yaronlevi/prefect/manifests/sha256:df9119a64997599ed66d07a1742ad9a405a73ed5fe006d4848a41d6eaee9e8e4>: 504 Gateway Time-out
Copy code
Flow run could not be submitted to infrastructure: TaskFailedToStart - CannotPullContainerError: ref pull has been retried 1 time(s): failed to copy: httpReadSeeker: failed open: unexpected status code <https://registry-1.docker.io/v2/yaronlevi/prefect/blobs/sha256:f5cc5422ebcbbf01f9cd227d36de9dd7e133e1fc6d852f3b0c65260ab58f99f3>: 504 Gateway Time-out
n
those appear to be network timeouts with your registry
y
Using ECS push workpools + prefect cloud
So this is a problem that docker.io should handle?
Maybe Prefect can make this more robust internally?
n
@Yaron Levi I think with the bug I posted, we aren't even getting to this stage, because the task is never getting submitted to ECS
👍 1
y
I think the orchestrator (Prefect) should take e2e responsibility for this. Because the other way, is saying: "This thing is behind this API, go talk to them"
n
Maybe Prefect can make this more robust internally?
its possible that we could retry internally in prefect cloud. this is not a commonly reported issue, most people pull from their ECR if they're already in ECS feel free to open an issue though!
👍 1