ESmith
09/16/2025, 1:57 PMMarvin
09/16/2025, 1:58 PMESmith
09/16/2025, 1:59 PMprefect.server.events.ordering.MaxDepthExceeded: occurred=DateTime(2025, 9, 16, 13, 47, 13, 77967, tzinfo=Timezone('UTC')) event='prefect.flow-run.Cancelling' resource=Resource(root={'prefect.state-name': 'Cancelling', 'prefect.state-type': 'CANCELLING', 'prefect.resource.id': 'prefect.flow-run.5a4dcd3b-d771-4291-b571-73887d6ce94c', 'prefect.resource.name': 'glossy-coucal', 'prefect.state-message': 'The parent flow run was cancelled.', 'prefect.state-timestamp'
Marvin
09/16/2025, 1:59 PMMarvin
09/16/2025, 2:02 PMprefect.server.events.ordering.MaxDepthExceeded
comes from Prefect’s event “causal ordering” component. It tries to keep related events (like flow-run state changes) in logical order even if they arrive slightly out of order.
- “MaxDepthExceeded” means the server saw an event (e.g., prefect.flow-run.Cancelling
) that appears to depend on a long chain of earlier events, and the internal buffer for tracking those predecessors hit its limit. In short: the ordering buffer was too deep for that event chain.
Is it harmful?
- Generally no for core orchestration. Your flow runs and state transitions are still processed.
- The main effect is that some downstream consumers of events (e.g., automations/triggers or event-driven features) may handle a particular event without strict causal ordering in rare cases. If you are not seeing misfiring automations or missing actions, this is typically just noisy logging.
Why it happens
- Bursts of out-of-order events (e.g., many rapid state changes, cancellations cascaded from parent/child runs)
- Backpressure or temporary delays in event processing
- Occasional clock skew between components (server/workers) can make events “look” out of order
Mitigations and next steps
- Upgrade to the latest Prefect 3.x release (event handling and logging around ordering are improved over time).
- Ensure time sync (NTP) on all machines running the server and workers to minimize perceived out-of-order delivery.
- If you’re seeing real symptoms (automations firing out of order, missed actions), please share:
- Your Prefect version
- Self-hosted Server vs. Cloud
- Database backend (SQLite/Postgres)
- Frequency and a longer slice of the log (a few lines before/after the error)
- Whether automations or notifications are behaving incorrectly
- If it’s just log noise and you’re not seeing user-facing issues, it’s safe to ignore.
References
- GitHub code/search for this exception and the causal ordering module: Search “MaxDepthExceeded” in Prefect repo
If you can share the details above, I can dig deeper and suggest more targeted steps.ESmith
09/16/2025, 2:04 PMMarvin
09/16/2025, 2:04 PMMarvin
09/16/2025, 2:07 PMSELECT id, name, state_type, state_name, state_id, start_time, end_time, parent_task_run_id
FROM flow_runs
WHERE state_name = 'Cancelling'
ORDER BY start_time DESC
LIMIT 100;
- Inspect the state history for a specific run:
SELECT id, flow_run_id, "type", "name", timestamp, message
FROM flow_run_states
WHERE flow_run_id = '<FLOW_RUN_ID>'
ORDER BY timestamp DESC
LIMIT 50;
- Find subflow runs for a given parent flow run:
SELECT sf.id, sf.name, sf.state_type, sf.state_name, sf.parent_task_run_id
FROM flow_runs sf
WHERE sf.parent_task_run_id IN (
SELECT tr.id FROM task_runs tr WHERE tr.flow_run_id = '<PARENT_FLOW_RUN_ID>'
);
Why these can get stuck
- When you cancel a parent, the child/subflow typically gets moved to Cancelling with a message like “The parent flow run was cancelled.” Finalizing to Cancelled requires either:
- the worker/infrastructure to acknowledge and stop the run; or
- the server’s CancellationCleanup service to eventually finalize the state if the infra has stopped but the final Cancelled event wasn’t observed.
- The MaxDepthExceeded error you saw means the event ordering buffer overflowed during a burst of related events. In some cases, that can prevent the “final” Cancelled transition/event from being processed cleanly, leaving a run stuck in Cancelling.
Things to check right now
- Verify the Server’s cancellation cleanup service is running and healthy. In self-hosted deployment logs, look for entries from the CancellationCleanup service:
- Code for reference: CancellationCleanup service
- Confirm your workers actually stop the underlying processes/pods when a cancel is issued (e.g., Kubernetes job deletion, Docker container stop, subprocess kill). If the process is still running, the DB will reflect Cancelling or Running even if you want it gone.
- Ensure your Server and Workers are on the latest 3.x versions and clocks are NTP-synced; the event ordering component is sensitive to clock skew and bursty, out-of-order events.
How to unstick affected runs
- Preferred: use the API to push a final Cancelled state. This will not kill the underlying process — make sure the workload is actually stopped at the infra level.
import asyncio
from prefect.client.orchestration import get_client
from prefect.states import Cancelled
async def force_cancel(flow_run_id: str):
async with get_client() as client:
result = await client.set_flow_run_state(flow_run_id, state=Cancelled(message="Force cancelling stuck run"))
print(result)
asyncio.run(force_cancel("<FLOW_RUN_ID>"))
- If you need to clean up multiple runs, you can list them (via the SQL above or the API) and loop the same client call. Avoid writing directly to the DB; use the API so orchestration rules remain consistent.
Next details that will help me pinpoint the root cause
- Prefect version (Server and Workers)
- Self-hosted Server or Prefect Cloud
- DB backend (Postgres or SQLite)
- Worker type/infra (Kubernetes, Docker, Process) and how you’re cancelling (UI vs. API)
- A short log excerpt around the time you cancel (including any MaxDepthExceeded lines and nearby cancellation messages)Marvin
09/16/2025, 2:07 PMESmith
09/16/2025, 2:13 PMMarvin
09/16/2025, 2:13 PMMarvin
09/16/2025, 2:13 PMESmith
09/16/2025, 2:15 PMMarvin
09/16/2025, 2:15 PMMarvin
09/16/2025, 2:16 PMESmith
09/16/2025, 2:19 PMMarvin
09/16/2025, 2:20 PMMarvin
09/16/2025, 2:22 PMimport asyncio
from prefect.client.orchestration import get_client
from prefect.states import Cancelled
async def force_cancel(flow_run_id: str):
async with get_client() as client:
await client.set_flow_run_state(flow_run_id, state=Cancelled(message="Force finalizing stuck run"))
# one run
asyncio.run(force_cancel("<FLOW_RUN_ID>"))
- SQL to find the stuck runs:
SELECT id, name, state_type, state_name, state_id, start_time, end_time, parent_task_run_id
FROM flow_runs
WHERE state_name = 'Cancelling'
ORDER BY start_time DESC
LIMIT 100;
And to view history for a specific run:
SELECT id, flow_run_id, "type", "name", timestamp, message
FROM flow_run_states
WHERE flow_run_id = '<FLOW_RUN_ID>'
ORDER BY timestamp DESC
LIMIT 50;
Prevention/next steps
- If you’re self-hosting, confirm services are enabled in your server process; if not, we can adjust your startup to include them.
- Ensure time sync (NTP) across server/workers; causal ordering is sensitive to clock skew.
- Consider upgrading to the latest 3.4.x — there have been incremental stability improvements in services/event handling over time. Latest notes: 3.4.18 changelog.
If you can share:
- How you start the server (command/container entrypoint)
- Your work pool/worker type (Kubernetes, Docker, Process)
- A short server log excerpt around cancellation time (including any “CancellationCleanup” and “MaxDepthExceeded” lines)
…I can zero in on whether it’s a services startup issue, a worker stop/kill configuration issue, or an events ordering edge case and give you exact steps to fix.Marvin
09/16/2025, 2:43 PMprefect flow-run delete <flow_run_id>
- UI: open the flow run → Actions → Delete
- Bulk delete with a small script (recommended)
- You can filter runs (by age, state, deployment, tags, etc.) then delete them via the Prefect client.
- Example: delete runs older than 30 days (adjust as needed):
import asyncio
from datetime import datetime, timedelta, timezone
from prefect.client.orchestration import get_client
from prefect.client.schemas.filters import (
FlowRunFilter,
FlowRunFilterStartTime,
)
async def main():
cutoff = datetime.now(timezone.utc) - timedelta(days=30)
async with get_client() as client:
# Pull a batch; re-run this script until cleared
runs = await client.read_flow_runs(
flow_run_filter=FlowRunFilter(
start_time=FlowRunFilterStartTime(before=cutoff)
),
limit=1000,
)
for r in runs:
await client.delete_flow_run(r.id)
print(f"Deleted {len(runs)} runs")
asyncio.run(main())
- Example: cancel stale RUNNING runs older than 2 hours (if that’s what “Unknown” actually is in your case):
import asyncio
from datetime import datetime, timedelta, timezone
from prefect.client.orchestration import get_client
from prefect.client.schemas.filters import (
FlowRunFilter,
FlowRunFilterStateType,
FlowRunFilterStartTime,
)
from prefect.client.schemas.objects import StateType
async def main():
cutoff = datetime.now(timezone.utc) - timedelta(hours=2)
async with get_client() as client:
runs = await client.read_flow_runs(
flow_run_filter=FlowRunFilter(
state=FlowRunFilterStateType(any_=[StateType.RUNNING]),
start_time=FlowRunFilterStartTime(before=cutoff),
),
limit=1000,
)
for r in runs:
# cancel via CLI is available; via client you can set a cancelled state if needed
await client.delete_flow_run(r.id)
print(f"Deleted {len(runs)} stale running runs")
asyncio.run(main())
- Reduce DB growth
- Logs and events are usually the biggest tables. If you don’t need events persisted, you can disable the Event Persister service:
- Set PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED=false
and restart the server.
- After deletes, Postgres won’t shrink on disk immediately. Autovacuum will reclaim space internally, but to shrink the file you may need a manual VACUUM FULL or maintenance window—coordinate with your DBA.
“Can I purge a specific table?”
- Not recommended. Prefect’s tables have relationships (flow_runs, task_runs, states, logs, events, etc.). Truncating one in isolation can leave dangling references and break the API/UI/scheduler.
- If you truly want a clean slate (dev/non-prod), use:
prefect server database reset
This drops and recreates all Prefect tables for the current database—nuclear option.
A couple of sanity checks
- Make sure server services are running so stale runs get reconciled over time:
```Marvin
09/16/2025, 2:43 PMLook for MarkLateRuns, CancellationCleanup, etc. If needed, restart:
prefect server services stop
prefect server services start
```
If you can share:
- Prefect version (2.x vs 3.x)
- A screenshot or the details pane of an “Unknown” run (state name/type, timestamps)
- Which tables are largest in RDS
I can tailor a precise bulk-cleanup script for your case (e.g., filter by deployment, tags, time ranges) and advise on safe retention settings.ESmith
09/17/2025, 8:38 AMMarvin
09/17/2025, 8:38 AMMarvin
09/17/2025, 8:42 AMflow_runs
- Flow run states: flow_run_states
- Task runs: task_runs
- Task run states: task_run_states
What the UI calls “Unknown” generally means one of these is true:
- flow_runs.state_id
is NULL
- The state_id
points to a missing row (rare)
- The denormalized columns `flow_runs.state_type`/`state_name` are NULL (because there’s no current state)
- The run has no rows in flow_run_states
at all
Some practical queries you can use (Postgres/SQLite-style SQL):
Find flow runs with no current state pointer
SELECT fr.id, fr.name, fr.created
FROM flow_runs fr
LEFT JOIN flow_run_states frs ON fr.state_id = frs.id
WHERE fr.state_id IS NULL OR frs.id IS NULL;
Find flow runs that have zero state history
SELECT fr.id, fr.name, fr.created
FROM flow_runs fr
LEFT JOIN flow_run_states frs ON frs.flow_run_id = fr.id
GROUP BY fr.id, fr.name, fr.created
HAVING COUNT(frs.id) = 0;
If your schema includes denormalized columns (common in Prefect 3.x), this is a quick check:
SELECT id, name, created
FROM flow_runs
WHERE state_type IS NULL;
Notes:
- The authoritative state history is always in `flow_run_states`/`task_run_states`. The current state is referenced by flow_runs.state_id
(and task_runs.state_id
) and often denormalized into `state_type`/`state_name` for fast filtering.
- There is no Unknown
value in the state_type
enum; it’s not persisted as such.
Docs for reference:
- States concepts
- Server schemas: states
- REST API: flow run states
If you can share which DB backend you’re using (Postgres vs. SQLite) and whether you’re seeing “Unknown” for flows, tasks, or both, I can tailor the exact queries.ESmith
09/17/2025, 8:47 AMSELECT fr.id, fr.name, fr.created
FROM flow_runs fr
LEFT JOIN flow_run_states frs ON fr.state_id = frs.id
WHERE fr.state_id IS NULL OR frs.id IS NULL;
Return 601 rows, do I delete said rows? Will it purge the "Unknown" flows from the UI or?Marvin
09/17/2025, 8:47 AMMarvin
09/17/2025, 8:51 AMflow_runs.state_id IS NULL
(or points to a missing row) => UI shows “Unknown”.
- There’s no persisted “UNKNOWN” state; it’s a fallback when the run has no current state.
Options to clean up
1) Remove them entirely (they’ll disappear from the UI)
- CLI, one-off:
prefect flow-run delete <FLOW_RUN_ID>
- CLI, bulk from a list:
# write your 601 ids to a file first (ids.txt)
cat ids.txt | xargs -n1 prefect flow-run delete
- Python (async):
from prefect.client.orchestration import get_client
ids = [...] # list of flow run UUIDs
async def main():
async with get_client() as client:
for rid in ids:
await client.delete_flow_run(rid)
2) “Repair” them by setting a terminal state instead of deleting
- This keeps the run for history but gets it out of “Unknown.”
- REST: Set flow run state
- Python (async):
from prefect.client.orchestration import get_client
from prefect.client.schemas.objects import State, StateType
ids = [...] # list of flow run UUIDs
async def main():
async with get_client() as client:
for rid in ids:
await client.set_flow_run_state(
rid,
state=State(type=StateType.FAILED, name="Failed"),
force=True, # safe here since there's no current state
)
Safety checks before bulk actions
- Prefer deleting or setting state via API/CLI instead of SQL DELETE.
- If you only want to target “true orphans,” narrow to runs with no history:
SELECT fr.id
FROM flow_runs fr
LEFT JOIN flow_run_states frs ON frs.flow_run_id = fr.id
GROUP BY fr.id
HAVING COUNT(frs.id) = 0;
References
- States overview: States concepts
- REST: Set flow run state
If you’d like, share whether you prefer to delete or repair. I can help you turn your SQL result into a quick script to perform the action in bulk.ESmith
09/17/2025, 8:53 AMMarvin
09/17/2025, 8:53 AMMarvin
09/17/2025, 8:54 AMflow_run_states
, task_runs
, task_run_states
, logs
, artifacts
, flow_run_input
, etc. Depending on your DB and schema version, not all FKs are guaranteed to cascade. A raw DELETE from flow_runs
can orphan children or fail due to FK constraints.
- Denormalized fields: Prefect 3.x keeps convenience columns like `flow_runs.state_type`/`state_name`. API operations maintain these correctly; raw SQL may not.
- Orchestration side-effects: Deleting/closing a run through the API can release resources, emit events, and keep UI counters/queries aligned. Raw SQL bypasses all of that.
- Upgrades and support: The DB schema is an internal implementation detail and can change. Manual edits may work today and break on upgrade—or make support/debugging harder.
What to do instead
- If you want them gone: delete via CLI/API (safe and supported).
prefect flow-run delete <FLOW_RUN_ID>
# bulk
cat ids.txt | xargs -n1 prefect flow-run delete
- If you want them recorded as “not successful” rather than removed: set a terminal state (e.g., Failed) via the client/REST so they stop showing as Unknown without losing history.
from prefect.client.orchestration import get_client
from prefect.client.schemas.objects import State, StateType
ids = [...] # your 601 IDs
async def main():
async with get_client() as client:
for rid in ids:
await client.set_flow_run_state(
rid,
state=State(type=StateType.FAILED, name="Failed"),
force=True,
)
REST ref: Set flow run state
If you still prefer SQL
- Back up your DB first.
- Only delete runs you’re sure have no children/history, or explicitly delete children first.
- Test on a staging copy. Verify no orphaned logs
, artifacts
, task_runs
, or *_states
remain.
If you paste a few sample run IDs or confirm whether you’d rather delete vs. mark as Failed, I can provide a ready-to-run script to process all 601 safely.