Hi :wave: We are using Prefect Cloud + ECS Push wo...
# ask-community
y
Hi 👋 We are using Prefect Cloud + ECS Push work pools and we just saw an error we've never seen before.
Copy code
Reached configured timeout of 300s for ECS
This happened the many flows (all those that were scheduled to run around that time).
j
Hey, I believe that is the default value of
task_start_timeout_seconds
on your ECS work pool. That controls when to crash the flow run if the ECS can't start in the duration. If you take a look at that task directly in ECS you should hopefully see more information about why it couldn't provision
👍 1
c
I have hit this sporadically, and have never seen any errors suggesting why. Each time issue occurred, no logs of any issue in target AWS account (even checked with DevOps). Each time, re-running affected tasks had no issue. Near as I can figure, transient network issue got in the way of traffic, leaving my task in a sad state. Its either that or sadness in Prefect infrastructure....
j
@Cormac always happy to take a look if you have a specific flow run id to ensure it's nothing in between Prefect Cloud and your ECS
c
@Jake Kaplan nothing recent, happily....
k
yeah, this can be hard to track down because there's a wide variety of possible causes. The one I've seen most commonly is that if you're using fargate, it's possible for the availability zone you're running your tasks in has been hit by high compute demand and you're being forced to wait longer than usual to get resources allocated
just as well could be a different cause though
c
The one I've seen most commonly is that if you're using fargate, it's possible for the availability zone you're running your tasks in has been hit by high compute demand and you're being forced to wait longer than usual to get resources allocated
Fair. Could be that. However last instance of issue occurred over a weekend, which may suggest not that... (assuming lighter load over weekend) If it does happen again, any wise words on what AWS logging to monitor for hints?
k
so it's waiting for the ecs task to enter a
STARTED
state. you should be able to find the ECS task intended for your flow run to happen in, and check out its state and logs. Sometimes you'll see logs about failed network requests or images that won't pull (ECS doesn't cache images)