https://prefect.io logo
j

John-Craig Borman

08/14/2023, 5:02 PM
Hi all, I am running into an issue with "zombie flow runs" in a k8s environment. We have a scheduler flow running that kicks off some large number of subflows in parallel. Occasionally a pod running one of the subflows will be evicted to avoid hitting resource limits and the pod will be rescheduled until a backoff limit is reached. At this point the subflow job will not be rescheduled by the cluster on another pod. At this point the subflow run should be marked as failed, but instead the flow run is stuck in a RUNNING state within the API
Ideally I would like to figure out how to correct the state from RUNNING to CRASHED or FAILED. It's important for the parent scheduler flow to know when a subflow has completed so that new subflows may be started
w

Will Raphaelson

08/14/2023, 7:42 PM
Hi @John-Craig Borman - this should indeed be marking the flow as crashed specifically as its an infra issue. are you using an agent or worker?
j

John-Craig Borman

08/14/2023, 7:46 PM
I believe we are using agents
w

Will Raphaelson

08/14/2023, 7:47 PM
so i believe that this will function correctly if you use the K8s work pool and worker - https://docs.prefect.io/2.11.3/concepts/work-pools/. But i can dig in a bit more on the agent side.
j

John-Craig Borman

08/14/2023, 7:50 PM
Can confirm, we're currently using agents and work queues
Is there a migration guide for agents/work-queues to workers/work-pools?
w

Will Raphaelson

08/14/2023, 7:59 PM
its actually in review right now it seems. you can read the diff on github or view the draft docs deployed on netlify https://github.com/PrefectHQ/prefect/pull/10365 https://deploy-preview-10365--prefect-docs-preview.netlify.app
j

John-Craig Borman

08/15/2023, 1:40 PM
This is helpful, thanks @Will Raphaelson - we'll look into upgrading
@Will Raphaelson can you confirm if workers/work-pools are still in beta in 2.11 or are they considered stable?
w

Will Raphaelson

08/16/2023, 1:58 PM
Workers and Work Pools are not in beta, they are GA. Push work pools ARE in beta but i dont think that what you’re looking at anyhow.