https://prefect.io logo
Title
j

jack

12/03/2021, 4:31 PM
How to track reruns? We've been calling
client.create_flow_run()
to create several flow runs (using ECSRun), and then polling each with
client.get_flow_run_state
to know when all the flows have completed. When one of the flows fails (and prefect starts a new flow run to take its place), how can we check when the rerun is complete (and whether it succeeded)?
k

Kevin Kho

12/03/2021, 4:34 PM
A bit confused why a new flow run will take its place? But I think using the
StartFlowRun
task will give you want you want because it `raise`s the end state of the flow is you set
wait=True
And then you can just use it like
StartFlowRun(…).run(...)
outside of a flow
j

jack

12/03/2021, 4:47 PM
Updated question to clarify the number of flows being run. Guessing we don't want to use wait=True since it's multiple flow runs.
k

Kevin Kho

12/03/2021, 4:52 PM
Ok that makes sense. I am still a bit confused what is starting a new flow run? Do you check for failure and then restart it? Is Prefect automatically restarting that (I don’t think we do)?
j

jack

12/03/2021, 4:54 PM
We check for failure but we do not manually restart. Let me verify what we're seeing here.
Here are the highlights of the prefect logs: 12:09pm (Some normal log output from the flow) 12:12pm No heartbeat detected from the remote task; marking the run as failed. 12:28pm Rescheduled by a Lazarus process. This is attempt 1. 12:28pm Submitted for execution: Task arn:aws:ecs: 12:29pm Beginning Flow run for xxx 12:29pm Flow run FAILED: some reference tasks failed.
Here is the GUI timeline
k

Kevin Kho

12/03/2021, 5:46 PM
Got it. That really helps. So Lazarus kicks in if the Flow can’t find the underlying compute to execute (Kubernetes or in this case ECS). Lazarus will re-submit the flow. Now to your question on how to get state. Basically you will need to use the GraphQL API I think. And then you can search by flow_id or by name and project, get the latest flow run, and then check the state.
You can query with something like this (though this is for starting Flow Runs). The point it to use the
client.graphql()
method with your query to pull the info
j

jack

12/03/2021, 5:52 PM
Reading the lazarus docs, it says lazarus runs once every 10 minutes. Would it be easier for us to disable lazarus for these flow runs, and then we could create a new flow run as soon as one is noticed to have failed?
k

Kevin Kho

12/03/2021, 5:54 PM
If that works for you, yep you can do that. Do you know how to disable Lazarus?
j

jack

12/03/2021, 5:56 PM
I don't know how to disable Lazarus
k

Kevin Kho

12/03/2021, 5:58 PM
You go to the Flow settings and then there is a toggle to turn it off
j

jack

12/03/2021, 5:59 PM
Can it be disabled from the ECSRun parameters?
Found the settings
k

Kevin Kho

12/03/2021, 6:02 PM
I think the UI is easier. Yeah you can disable it there
👍 1