https://prefect.io logo
Title
l

Leonard Marcq

09/13/2020, 10:02 AM
I have 32 flows each running on a schedule (3 minutes interval) on Prefect server (not Cloud). I have a Fargate agent running on an EC2. Everything works well, except for ~7% of my flow runs which fail to start due to an AWS service limit on Fargate (
An error occurred (ThrottlingException) when calling the RunTask operation (reached max retries: 4): Rate exceeded.
). My issue is that those flow runs are marked as failed and never retried. I originally thought that Lazarus would retry failed flow runs, but it seems I misunderstood. Is there a recommended way of retrying failed flow runs that failed to even start?
j

Jeremiah

09/13/2020, 12:16 PM
Lazarus restarts distressed flow runs - meaning runs that indicate they should be running but have no running tasks. It doesn’t automatically retry failed runs.
l

Leonard Marcq

09/13/2020, 1:40 PM
Ok, thank you. I guess what got me hoping for that was this part of the doc:
The Lazarus process is meant to gracefully retry failures caused by factors outside of Prefect's control. The most common situations requiring Lazarus intervention are infrastructure issues, such as Kubernetes pods not spinning up or being deleted before they're able to complete a run.
I also wasn't clear on what "distressed" flow runs were. So I guess I will have to cook up something to retrieve the failed flow runs at some point and
set_flow_run_state
to
Scheduled
to restart them (as in https://docs.prefect.io/orchestration/concepts/flow_runs.html#graphql-2)
m

Michael Ludwig

09/14/2020, 6:00 AM
You can also let those Fargate limits to be increased by AWS with a support request. I think it is limited to 50 by default which could be too little.
l

Leonard Marcq

09/14/2020, 6:25 AM
@Michael Ludwig - Thanks, I saw that we could request a limit increase, I will also do that. The limit seems to be even lower than 50 in my region (maybe 30). Although finding out about the lack of automatic retries on flow runs leads me to think I might also want to refactor my flows and decrease the number of different flows to have fewer but bigger flows that do more work - otherwise I’ll need to request a super high Fargate limit or mess with my schedules to try to somehow not have them all run at the same time
j

Jeremiah

09/14/2020, 3:43 PM
@Leonard Marcq we will work to clarify the Lazarus behavior, sorry for the confusion. Two potentially helpful notes: flow-level retries are on our roadmap, but I do not yet have an expected date for you. In addition, Prefect Cloud’s enterprise tiers have support for custom flow-level (or task-level) concurrency limits which would help you avoid massive parallelism gracefully.
l

Leonard Marcq

09/14/2020, 6:25 PM
@Jeremiah - Thank you. Would indeed be nice to have flow-level retries in general, until that is released I’ll just refactor the logic of our flows. I saw the concurrency limit feature that Cloud offers and it would have been very helpful. Unfortunately we are in China; so the Great Firewall is sitting right between our infrastructure and the Prefect Cloud (and almost everything else lol)
j

Jeremiah

09/14/2020, 7:25 PM
Understood. That feature will eventually make its way to Server as bandwidth allows.
(engineering bandwidth not network 🙂 )
l

Leonard Marcq

09/16/2020, 8:17 PM
I ended up just refactoring my flows into fewer bigger flows that do some logic to retrieve initial data (Aircraft tracking example style) and generate a bunch of mapped tasks (which originally were my flows) that are processed on a Dask cluster on Fargate. It seems to make a lot more sense than my original thing - much easier to track in the UI, much easier to reprocess, and a lot cheaper (since Fargate bills a minimum of 1 minute, when I had very short flows)
j

Jeremiah

09/16/2020, 8:29 PM
Glad to hear it, thanks for updating us!