Hi, In Prefect 1 we often see flows stuck in Runni...
# prefect-community
t
Hi, In Prefect 1 we often see flows stuck in Running state because the underlying k8s pod might have been terminated. Prefect is for some reason not able to detect this and the flow stays in a running state. In one case the logs show
Flow run RUNNING: terminal tasks are incomplete.
as the last log message, then flow stays in running state forever. Is this a known issue?
1️⃣ 1
1
Seems another user reported issues similar to this one. In our case we have to rely on automatic recovery of this failure. Our job runs every 15 minutes and we can't have it blocking for more than about half an hour because our customers are depending on the data that the flow generates. Manually cancelling the job is not an option...
m
Hey @Ton Steijvers The short version here is that this is a limitation of Prefect 1, this is one of the specific Use case we're hoping to solve for in prefect 2 or at least have some that is functionally more robust in the event of these kinds of issues, Cancelation in prefect 2 is already better suited to this. This discourse article goes in to detail around this, it's referencing failing flows specifically but the difficulties around it are virtually the same in that we don't really have any insight into what's occurring on the Kubernetes side and being overly opinionated here runs the risk of interrupting work that is actually functioning correctly. this section specifically addresses some of the challenges around this "Tracking flow heartbeats in a hybrid execution model is challenging for several reasons:" That being said for this specific issue you can setup Flow Level SLAs To cancel a job if it hasn't finished in X amount of time, this is a paid feature of cloud 1 so it isn't available on the free tier.
t
Hi @Mason Menges, thanks for the info. Meanwhile I ported my flow to Prefect 2 but I'm still seeing instances where all the tasks of my flow are marked as "Completed" but the flow still remains in a "Running" state. I have specified a
timeout_seconds
in my flow but that timeout is already long past. The k8s pod that was running the flow is already terminated. My flow is only allowed exactly one flow run at a time, so this long running flow now blocks subsequent flow runs. I'm using
read_flow_runs
to see if there are any other flow runs still active.