Syméon del Marmol
09/10/2025, 11:42 AMMarvin
09/10/2025, 12:48 PMMarvin
09/10/2025, 1:01 PMprefect flow-run ls --state RUNNING
- If you need to manually unstick one while you diagnose, you can cancel it:
prefect flow-run cancel <FLOW_RUN_ID>
Then re-run its deployment.
How to auto-resubmit reliably
- Create an Automation that triggers on flow runs transitioning to CRASHED and runs the corresponding deployment again. This is the cleanest “self-healing” pattern once the zombie detector is working.
- Docs (Automations): docs.prefect.io (Automations section)
Extra resilience for background tasks
Since you’re launching work via background .delay()
tasks and awaiting via `.result_async()`:
- Always await with a timeout; if it exceeds “heartbeat timeout + buffer”, treat it as a failure and initiate a retry/resubmission path from your app (e.g., cancel the stuck run and submit a new one).
- If you dynamically name tasks via .with_options(name=...)
, that’s fine — names don’t affect heartbeat; just make sure your task worker pods also emit heartbeats so their task runs are detected as zombies if they die.
If you need a belt-and-suspenders watchdog
If you want immediate remediation even if the server’s zombie service were misconfigured, you can run a small periodic job that:
- Finds flow runs in RUNNING with “last heartbeat” older than N minutes,
- Transitions them to CRASHED (or cancels them), then
- Resubmits the same deployment with the same parameters.
I’m happy to share a concrete snippet once we confirm whether you’re on Cloud or OSS, since APIs differ slightly in how you’ll fetch “last heartbeat” and resubmit.
Could you share a bit more so I can pinpoint the fix?
- Are you on Prefect Cloud or running Prefect Server yourself?
- How are you starting your API/server (e.g., prefect server start
, a Helm chart, custom compose)?
- Infra type for your worker (Process worker in k8s? Docker?) and how you’re launching flows (via deployments, or invoking flows from a background task)?Marvin
09/10/2025, 1:01 PMSyméon del Marmol
09/10/2025, 1:09 PMMarvin
09/10/2025, 1:09 PMMarvin
09/10/2025, 1:14 PM--no-services
or services aren’t running, zombies won’t be marked.
How to check your server services
- List services and whether they’re enabled:
prefect server services ls
You should see rows of services and the env var that toggles each one. Make sure you are not starting the server with --no-services
.
- If services are not running, start them:
prefect server services start
Or run your server with services enabled:
prefect server start
Note: prefect server start --no-services
runs only the webserver, which will prevent background detectors from working.
How to confirm heartbeats for a specific run
- Inspect a run to see the heartbeat field and timestamps:
prefect flow-run inspect <FLOW_RUN_ID>
If “last heartbeat” is empty and “Last Updated” equals “Created”, it usually means the engine process never started (or died immediately) — the server still should mark it CRASHED once the zombie detector runs, assuming services are up.
Make sure the engine is actually emitting heartbeats
- The engine’s heartbeat cadence is controlled by runner settings; set logging to DEBUG in your worker/runner container and look for heartbeat logs.
- Ensure you have a reasonable heartbeat frequency (typically 30s+). If you’re setting it via env var, it’s commonly:
- PREFECT_RUNNER_HEARTBEAT_FREQUENCY=30
- If heartbeats are disabled or the engine is never starting, the server won’t see heartbeats.
Recommended self-healing flow
- Once zombies are marked CRASHED, add an Automation to resubmit the deployment on CRASHED. That’s the cleanest no-downtime pattern for idempotent flows:
- Trigger: Flow run state = CRASHED
- Action: Run Deployment (same deployment)
Next steps to diagnose your setup
- Can you share:
- The output of:
prefect server services ls
- How you start the server (e.g., the exact prefect server start
command/args).
- The output of:
prefect flow-run inspect <FLOW_RUN_ID>
for a run that stayed RUNNING.
- Whether your worker/engine logs show heartbeat messages (set PREFECT_LOGGING_LEVEL=DEBUG
temporarily).
With that, I can pinpoint whether this is services not running, the engine not heartbeating, or something else — and suggest the exact fix. In parallel, if you need a safety net while diagnosing, you can also schedule a tiny watchdog job that cancels/resubmits RUNNING runs whose “last heartbeat” is older than N minutes. I can share a snippet once we confirm whether you’re on OSS Server or Cloud (you mentioned OSS Server) and which API pattern you prefer (Python client vs CLI).Syméon del Marmol
09/10/2025, 1:22 PMMarvin
09/10/2025, 1:22 PMSyméon del Marmol
09/10/2025, 1:27 PMMarvin
09/10/2025, 1:27 PMMarvin
09/10/2025, 1:27 PMauto-resubmit
) and have the automation only act on runs with that tag.
- Docs: Detect zombie flows (the same resubmission pattern applies).
3) Add a “first-heartbeat watchdog” in your backend
- When you create a flow run, watch it briefly for a first heartbeat or a transition to RUNNING. If you don’t see a heartbeat within, say, 60–120s (and the run is CANCELLED or still pending without movement), treat it as a failed start and resubmit.
- This covers cases where infra/worker disappears before the engine initializes.
- CLI helpers you can use while debugging:
prefect flow-run inspect <FLOW_RUN_ID>
prefect flow-run ls --state RUNNING
prefect deployment run <FLOW_NAME>/<DEPLOYMENT_NAME> --watch
4) Double-check server services are running
- If services aren’t running, the server won’t detect zombies later either. Verify:
prefect server services ls
If needed:
prefect server services start
- Make sure you aren’t starting with --no-services
.
Recommended path for you
- If you can, move these flows to a Docker or Kubernetes work pool. That will decouple flow execution from the worker’s lifecycle and eliminate most of these “pre-heartbeat” cancellations.
- Add a small automation to resubmit on CANCELLED and CRASHED for runs tagged by your backend. That gives you coverage even when the engine never starts.
If you share which work pool type you’re using today (Process/Docker/Kubernetes) and whether you want a sample Automation config or a small watchdog snippet, I can tailor the exact steps.Marvin
09/10/2025, 1:30 PMprefect server services ls
Make sure “CancellationCleanup” and “TaskRunRecorder” show as enabled.
- If you started the server with --no-services
, start them:
prefect server services start
- Then reproduce the crash and watch the server logs for “CancellationCleanup” activity; it should report cleaning up task runs under terminal flow runs.
Quick triage/cleanup if you still have stuck task runs
- You can list RUNNING task runs:
prefect task-run ls --state RUNNING
- And as a temporary remediation, run a small script to mark child task runs CRASHED when their parent is CRASHED/CANCELLED:
import asyncio
from prefect.client.orchestration import get_client
from prefect.states import Crashed
from prefect.schemas.filters import (
TaskRunFilter,
TaskRunFilterState,
FlowRunFilter,
FlowRunFilterId,
)
async def crash_child_tasks(flow_run_id: str):
async with get_client() as client:
tr_filter = TaskRunFilter(
state=TaskRunFilterState(name={"any_": ["RUNNING"]})
)
fr_filter = FlowRunFilter(id=FlowRunFilterId(any_=[flow_run_id]))
task_runs = await client.read_task_runs(
flow_run_filter=fr_filter,
task_run_filter=tr_filter,
limit=5000,
)
for tr in task_runs:
await client.set_task_run_state(
tr.id,
state=Crashed("Parent flow terminated"),
force=True,
)
if __name__ == "__main__":
asyncio.run(crash_child_tasks("<FLOW_RUN_ID>"))
Best-practice pattern to be resilient
- Ensure server services are on (CancellationCleanup, TaskRunRecorder).
- Add an Automation: “On flow run CRASHED → Run deployment” to resubmit idempotent flows.
- Optionally add a small periodic watchdog flow that:
- Finds flow runs in CRASHED/CANCELLED and marks any RUNNING child task runs as CRASHED
- Resubmits the parent run if needed
- In your worker shutdown, prefer graceful SIGTERM so Prefect can cancel child tasks cleanly; for true crash handling, rely on the services above.
If CancellationCleanup is enabled and you still see RUNNING task runs under a CRASHED flow (after a few minutes), that may be a bug. If you can share:
- One affected Flow Run ID
- The output of prefect server services ls
- A snippet of your server logs around the time of the crash
I can help dig in or open an issue with a minimal repro.
Docs for reference:
- Zombie detection: Detect zombie flows