Hi <@ULVA73B9P> I'm using background tasks to trig...
# ask-community
s
Hi @Marvin I'm using background tasks to trigger flows (with sub-flows and tasks / sub-tasks) that might take some time to finish. I have multiple replicas of the backend calling the background tasks as well as multiple replicas of the workers. I must make the system resilient to restarts (no-downtime deployment, potential crash of workers, etc.). The flow is idempotent so it can be restarted by another worker after the original one would have been stopped. => So I need a way to detect flows that are not attached to a worker anymore and "reschedule" them accordingly. What would be the appropriate way to achieve that ? I saw this page talking about zombie flows detection but my flows are still stuck in running after I restart my worker container. Asked Marvin and it told me the flow would move to "crashed" state, while it stays "running" forever for me. (Using Prefect 3.4.17) Thanks ! 🙂
m
âś… thought for 804.6 seconds
Thanks for the detailed context — this absolutely should be possible with Prefect 3.x, but a couple of components need to be in place for zombie/orphaned runs to be detected and resubmitted reliably. What should happen - Flow runs emit heartbeats from the flow engine process. If those heartbeats stop (e.g., the worker pod dies and kills the engine process), the API’s zombie detection should transition the run from RUNNING to CRASHED within a short window. - From there, you can use an Automation to automatically resubmit the run (e.g., “On flow run crashed → Run deployment”). - Docs: Detect zombie flows Why your runs might remain RUNNING If a run stays RUNNING after a worker restart, it usually means one of: - Heartbeats were never being sent from the engine process, or - Your API server’s “repossessor” service isn’t running or isn’t configured to act on missed heartbeats, or - The timeout window is much larger than you expect. A few checks to run 1) Confirm heartbeats are enabled and visible - The flow engine emits heartbeats on a cadence controlled by the runner settings; ensure you haven’t disabled or overridden them. - Turn on DEBUG logs for the engine process and check for heartbeat logs; in the UI, open the Flow Run details and check “Last heartbeat”. If it keeps updating after a worker restart, the engine may not have been killed as expected. - Docs: Detect zombie flows 2) Confirm your API’s zombie detector is running - If you’re on Prefect Cloud, this service is managed for you. - If you’re running Prefect Server yourself, ensure the server services are running (specifically the “Repossessor” service) — that’s what marks stale RUNNING runs as CRASHED when heartbeats stop. - Docs: Detect zombie flows 3) Sanity-check CLI and UI views - List runs in RUNNING state:
Copy code
prefect flow-run ls --state RUNNING
- If you need to manually unstick one while you diagnose, you can cancel it:
Copy code
prefect flow-run cancel <FLOW_RUN_ID>
Then re-run its deployment. How to auto-resubmit reliably - Create an Automation that triggers on flow runs transitioning to CRASHED and runs the corresponding deployment again. This is the cleanest “self-healing” pattern once the zombie detector is working. - Docs (Automations): docs.prefect.io (Automations section) Extra resilience for background tasks Since you’re launching work via background
.delay()
tasks and awaiting via `.result_async()`: - Always await with a timeout; if it exceeds “heartbeat timeout + buffer”, treat it as a failure and initiate a retry/resubmission path from your app (e.g., cancel the stuck run and submit a new one). - If you dynamically name tasks via
.with_options(name=...)
, that’s fine — names don’t affect heartbeat; just make sure your task worker pods also emit heartbeats so their task runs are detected as zombies if they die. If you need a belt-and-suspenders watchdog If you want immediate remediation even if the server’s zombie service were misconfigured, you can run a small periodic job that: - Finds flow runs in RUNNING with “last heartbeat” older than N minutes, - Transitions them to CRASHED (or cancels them), then - Resubmits the same deployment with the same parameters. I’m happy to share a concrete snippet once we confirm whether you’re on Cloud or OSS, since APIs differ slightly in how you’ll fetch “last heartbeat” and resubmit. Could you share a bit more so I can pinpoint the fix? - Are you on Prefect Cloud or running Prefect Server yourself? - How are you starting your API/server (e.g.,
prefect server start
, a Helm chart, custom compose)? - Infra type for your worker (Process worker in k8s? Docker?) and how you’re launching flows (via deployments, or invoking flows from a background task)?
- One example Flow Run ID that remained RUNNING after a worker restart, and what the UI shows for its “Last heartbeat”. Once I know those, I can give you the exact settings to verify (and the Automation or watchdog code to drop in).
s
@Marvin I'm running the prefect server myself. I don't see any heartbeat in the details of the flow run. Only the "Last Updated" which has the same timestamp as "created" entry. How can I whether the heartbeat service is running ?
m
âś… thought for 315.5 seconds
Great question — a couple of important points to clarify first: - Heartbeats are emitted by the flow engine (the process actually running your flow), not by the server. If a worker dies and kills that engine process, heartbeats should stop and the server should mark the run CRASHED. - The server needs its loop services running to detect and act on zombies. If you started the server with
--no-services
or services aren’t running, zombies won’t be marked. How to check your server services - List services and whether they’re enabled:
Copy code
prefect server services ls
You should see rows of services and the env var that toggles each one. Make sure you are not starting the server with
--no-services
. - If services are not running, start them:
Copy code
prefect server services start
Or run your server with services enabled:
Copy code
prefect server start
Note:
prefect server start --no-services
runs only the webserver, which will prevent background detectors from working. How to confirm heartbeats for a specific run - Inspect a run to see the heartbeat field and timestamps:
Copy code
prefect flow-run inspect <FLOW_RUN_ID>
If “last heartbeat” is empty and “Last Updated” equals “Created”, it usually means the engine process never started (or died immediately) — the server still should mark it CRASHED once the zombie detector runs, assuming services are up. Make sure the engine is actually emitting heartbeats - The engine’s heartbeat cadence is controlled by runner settings; set logging to DEBUG in your worker/runner container and look for heartbeat logs. - Ensure you have a reasonable heartbeat frequency (typically 30s+). If you’re setting it via env var, it’s commonly: -
PREFECT_RUNNER_HEARTBEAT_FREQUENCY=30
- If heartbeats are disabled or the engine is never starting, the server won’t see heartbeats. Recommended self-healing flow - Once zombies are marked CRASHED, add an Automation to resubmit the deployment on CRASHED. That’s the cleanest no-downtime pattern for idempotent flows: - Trigger: Flow run state = CRASHED - Action: Run Deployment (same deployment) Next steps to diagnose your setup - Can you share: - The output of:
Copy code
prefect server services ls
- How you start the server (e.g., the exact
prefect server start
command/args). - The output of:
Copy code
prefect flow-run inspect <FLOW_RUN_ID>
for a run that stayed RUNNING. - Whether your worker/engine logs show heartbeat messages (set
PREFECT_LOGGING_LEVEL=DEBUG
temporarily). With that, I can pinpoint whether this is services not running, the engine not heartbeating, or something else — and suggest the exact fix. In parallel, if you need a safety net while diagnosing, you can also schedule a tiny watchdog job that cancels/resubmits RUNNING runs whose “last heartbeat” is older than N minutes. I can share a snippet once we confirm whether you’re on OSS Server or Cloud (you mentioned OSS Server) and which API pattern you prefer (Python client vs CLI).
s
@Marvin Now I see the heartbeat. It is not coming right after the start and instead of being "CRASHED", the flow and tasks got "CANCELLED". How can I deal with worker crashing before the flow even emit the first heartbeat ?
m
âś… thought for 310.0 seconds
s
@Marvin ok understood. The graceful shutdown process of the worker gets the flow cancelled before the process dies. In the contrary, if I just kill the process; then it eventually switch to a crashed state. However, the tasks inside the flow is still remains "running" which is a problem
m
âś… thought for 206.5 seconds
Good find — this is a known corner case. If the worker crashes before the flow engine starts (i.e., before the first heartbeat is emitted), the server cannot infer “lost heartbeats” and will often end up marking the run as CANCELLED instead of CRASHED. That’s because no engine ever started to send heartbeats; from the server’s perspective the reservation was abandoned. You’ve got a few solid options to make this resilient: 1) Prefer isolated infrastructure for long-running flows - If you’re using a Process work pool, the engine runs inside the worker process. A worker crash can kill the engine before it starts, leading to CANCELLED without heartbeats. - Using Docker or Kubernetes work pools isolates the flow-run engine in its own container/job. The worker can die/restart without killing running flows, and you’ll get proper heartbeats or a later zombie-detection transition to CRASHED if the infra dies. - This is the most robust fix for “no downtime” patterns. 2) Add an Automation to re-run on both CRASHED and CANCELLED - Since “failed to start” will materialize as CANCELLED, create an Automation that triggers on either: - Flow run state = CRASHED - Flow run state = CANCELLED - To avoid re-running truly intentional cancellations, scope the automation using a tag you apply to runs started by your backend (e.g., tag runs with
auto-resubmit
) and have the automation only act on runs with that tag. - Docs: Detect zombie flows (the same resubmission pattern applies). 3) Add a “first-heartbeat watchdog” in your backend - When you create a flow run, watch it briefly for a first heartbeat or a transition to RUNNING. If you don’t see a heartbeat within, say, 60–120s (and the run is CANCELLED or still pending without movement), treat it as a failed start and resubmit. - This covers cases where infra/worker disappears before the engine initializes. - CLI helpers you can use while debugging:
Copy code
prefect flow-run inspect <FLOW_RUN_ID>
  prefect flow-run ls --state RUNNING
  prefect deployment run <FLOW_NAME>/<DEPLOYMENT_NAME> --watch
4) Double-check server services are running - If services aren’t running, the server won’t detect zombies later either. Verify:
Copy code
prefect server services ls
If needed:
Copy code
prefect server services start
- Make sure you aren’t starting with
--no-services
. Recommended path for you - If you can, move these flows to a Docker or Kubernetes work pool. That will decouple flow execution from the worker’s lifecycle and eliminate most of these “pre-heartbeat” cancellations. - Add a small automation to resubmit on CANCELLED and CRASHED for runs tagged by your backend. That gives you coverage even when the engine never starts. If you share which work pool type you’re using today (Process/Docker/Kubernetes) and whether you want a sample Automation config or a small watchdog snippet, I can tailor the exact steps.
Got it — thanks for testing both shutdown modes. What you’re seeing is consistent with how things work: - Hard kill: the engine stops heartbeating → the server marks the flow CRASHED. - Graceful shutdown: the engine cancels the flow → server marks the flow CANCELLED. - Your issue: some child task runs remain RUNNING after the parent flow reaches a terminal state (CRASHED or CANCELLED). Those child tasks should be cleaned up by the server’s loop services. Two services are particularly relevant: - CancellationCleanup: sweeps and cleans up non-terminal children (e.g., RUNNING task runs) for terminal parent runs. - TaskRunRecorder: tracks task-run events and helps reconcile “lost” tasks. First, confirm both services are enabled and running - List services:
Copy code
prefect server services ls
Make sure “CancellationCleanup” and “TaskRunRecorder” show as enabled. - If you started the server with
--no-services
, start them:
Copy code
prefect server services start
- Then reproduce the crash and watch the server logs for “CancellationCleanup” activity; it should report cleaning up task runs under terminal flow runs. Quick triage/cleanup if you still have stuck task runs - You can list RUNNING task runs:
Copy code
prefect task-run ls --state RUNNING
- And as a temporary remediation, run a small script to mark child task runs CRASHED when their parent is CRASHED/CANCELLED:
Copy code
import asyncio
  from prefect.client.orchestration import get_client
  from prefect.states import Crashed
  from prefect.schemas.filters import (
      TaskRunFilter,
      TaskRunFilterState,
      FlowRunFilter,
      FlowRunFilterId,
  )

  async def crash_child_tasks(flow_run_id: str):
      async with get_client() as client:
          tr_filter = TaskRunFilter(
              state=TaskRunFilterState(name={"any_": ["RUNNING"]})
          )
          fr_filter = FlowRunFilter(id=FlowRunFilterId(any_=[flow_run_id]))
          task_runs = await client.read_task_runs(
              flow_run_filter=fr_filter,
              task_run_filter=tr_filter,
              limit=5000,
          )
          for tr in task_runs:
              await client.set_task_run_state(
                  tr.id,
                  state=Crashed("Parent flow terminated"),
                  force=True,
              )

  if __name__ == "__main__":
      asyncio.run(crash_child_tasks("<FLOW_RUN_ID>"))
Best-practice pattern to be resilient - Ensure server services are on (CancellationCleanup, TaskRunRecorder). - Add an Automation: “On flow run CRASHED → Run deployment” to resubmit idempotent flows. - Optionally add a small periodic watchdog flow that: - Finds flow runs in CRASHED/CANCELLED and marks any RUNNING child task runs as CRASHED - Resubmits the parent run if needed - In your worker shutdown, prefer graceful SIGTERM so Prefect can cancel child tasks cleanly; for true crash handling, rely on the services above. If CancellationCleanup is enabled and you still see RUNNING task runs under a CRASHED flow (after a few minutes), that may be a bug. If you can share: - One affected Flow Run ID - The output of
prefect server services ls
- A snippet of your server logs around the time of the crash I can help dig in or open an issue with a minimal repro. Docs for reference: - Zombie detection: Detect zombie flows