https://prefect.io logo
Title
m

Mike Grabbe

01/10/2023, 3:33 PM
Hey all, my team has been using ECSTask block infrastructure for a couple months, and we've noticed that flows will occasionally fail with an error similar to
Submission failed. RuntimeError: Timed out after 240.94525742530823s while watching task for status {until_status or 'STOPPED'}
, and when flows fail like this, it doesn't trigger an automatic retry, even if the flow has been configured to do so. Any ideas how I can get the retries to kick in so we don't get these occasional errors?
1
k

Kalise Richmond

01/10/2023, 3:59 PM
Hey @Mike Grabbe, do you happen to have debug logs from the agent that is running this? Which version of Prefect are you using?
m

Mike Grabbe

01/10/2023, 4:00 PM
Im on latest version, 2.7.7
I'll try to dig up the logs
z

Zanie

01/10/2023, 4:16 PM
Retries are for errors in the flow itself right now, this is an error outside the flow (i.e. in infrastructure management). We have a goal to add separate retry settings for that.
It sounds like you want to increase https://prefecthq.github.io/prefect-aws/ecs/#prefect_aws.ecs.ECSTask.task_start_timeout_seconds ? Although the default is 120s so it sounds like you already have perhaps?
m

Mike Grabbe

01/10/2023, 4:32 PM
Right, I did. My guess is that increasing the timeout any more wont do any good. These are proably ECS task launch failures
OK so the scope of the retry configuration for now is limited to the task or the flow, but its not yet available at the infra level
z

Zanie

01/10/2023, 4:40 PM
Yep!
m

Mike Grabbe

01/10/2023, 4:49 PM
Thanks @Zanie, glad to see this is already on the roadmap
Following up on this, @Kalise Richmond: I added the stack trace from the agent logs into the github issue linked directly above.
👍 1