Hi all, I am running into an issue with "zombie fl...
# ask-community
j
Hi all, I am running into an issue with "zombie flow runs" in a k8s environment. We have a scheduler flow running that kicks off some large number of subflows in parallel. Occasionally a pod running one of the subflows will be evicted to avoid hitting resource limits and the pod will be rescheduled until a backoff limit is reached. At this point the subflow job will not be rescheduled by the cluster on another pod. At this point the subflow run should be marked as failed, but instead the flow run is stuck in a RUNNING state within the API
Ideally I would like to figure out how to correct the state from RUNNING to CRASHED or FAILED. It's important for the parent scheduler flow to know when a subflow has completed so that new subflows may be started
w
Hi @John-Craig Borman - this should indeed be marking the flow as crashed specifically as its an infra issue. are you using an agent or worker?
j
I believe we are using agents
w
so i believe that this will function correctly if you use the K8s work pool and worker - https://docs.prefect.io/2.11.3/concepts/work-pools/. But i can dig in a bit more on the agent side.
j
Can confirm, we're currently using agents and work queues
Is there a migration guide for agents/work-queues to workers/work-pools?
w
its actually in review right now it seems. you can read the diff on github or view the draft docs deployed on netlify https://github.com/PrefectHQ/prefect/pull/10365 https://deploy-preview-10365--prefect-docs-preview.netlify.app
j
This is helpful, thanks @Will Raphaelson - we'll look into upgrading
@Will Raphaelson can you confirm if workers/work-pools are still in beta in 2.11 or are they considered stable?
w
Workers and Work Pools are not in beta, they are GA. Push work pools ARE in beta but i dont think that what you’re looking at anyhow.