I have 32 flows each running on a schedule (3 minu...
# prefect-community
l
I have 32 flows each running on a schedule (3 minutes interval) on Prefect server (not Cloud). I have a Fargate agent running on an EC2. Everything works well, except for ~7% of my flow runs which fail to start due to an AWS service limit on Fargate (
An error occurred (ThrottlingException) when calling the RunTask operation (reached max retries: 4): Rate exceeded.
). My issue is that those flow runs are marked as failed and never retried. I originally thought that Lazarus would retry failed flow runs, but it seems I misunderstood. Is there a recommended way of retrying failed flow runs that failed to even start?
j
Lazarus restarts distressed flow runs - meaning runs that indicate they should be running but have no running tasks. It doesn’t automatically retry failed runs.
l
Ok, thank you. I guess what got me hoping for that was this part of the doc:
Copy code
The Lazarus process is meant to gracefully retry failures caused by factors outside of Prefect's control. The most common situations requiring Lazarus intervention are infrastructure issues, such as Kubernetes pods not spinning up or being deleted before they're able to complete a run.
I also wasn't clear on what "distressed" flow runs were. So I guess I will have to cook up something to retrieve the failed flow runs at some point and
set_flow_run_state
to
Scheduled
to restart them (as in https://docs.prefect.io/orchestration/concepts/flow_runs.html#graphql-2)
m
You can also let those Fargate limits to be increased by AWS with a support request. I think it is limited to 50 by default which could be too little.
l
@Michael Ludwig - Thanks, I saw that we could request a limit increase, I will also do that. The limit seems to be even lower than 50 in my region (maybe 30). Although finding out about the lack of automatic retries on flow runs leads me to think I might also want to refactor my flows and decrease the number of different flows to have fewer but bigger flows that do more work - otherwise I’ll need to request a super high Fargate limit or mess with my schedules to try to somehow not have them all run at the same time
j
@Leonard Marcq we will work to clarify the Lazarus behavior, sorry for the confusion. Two potentially helpful notes: flow-level retries are on our roadmap, but I do not yet have an expected date for you. In addition, Prefect Cloud’s enterprise tiers have support for custom flow-level (or task-level) concurrency limits which would help you avoid massive parallelism gracefully.
l
@Jeremiah - Thank you. Would indeed be nice to have flow-level retries in general, until that is released I’ll just refactor the logic of our flows. I saw the concurrency limit feature that Cloud offers and it would have been very helpful. Unfortunately we are in China; so the Great Firewall is sitting right between our infrastructure and the Prefect Cloud (and almost everything else lol)
j
Understood. That feature will eventually make its way to Server as bandwidth allows.
(engineering bandwidth not network 🙂 )
l
I ended up just refactoring my flows into fewer bigger flows that do some logic to retrieve initial data (Aircraft tracking example style) and generate a bunch of mapped tasks (which originally were my flows) that are processed on a Dask cluster on Fargate. It seems to make a lot more sense than my original thing - much easier to track in the UI, much easier to reprocess, and a lot cheaper (since Fargate bills a minimum of 1 minute, when I had very short flows)
j
Glad to hear it, thanks for updating us!