<@ULVA73B9P> in Prefect OSS. I have a lot of flows...
# ask-marvin
e
@Marvin in Prefect OSS. I have a lot of flows in an "Unknown" state which are seemingly trying to run and thus filling up the underlying RDS DB. Is there a certain table that I can purge that would get rid of these?
m
thought for 2738.8 seconds
e
In the prefect server logs I have errors that look like @Marvin:
Copy code
prefect.server.events.ordering.MaxDepthExceeded: occurred=DateTime(2025, 9, 16, 13, 47, 13, 77967, tzinfo=Timezone('UTC')) event='prefect.flow-run.Cancelling' resource=Resource(root={'prefect.state-name': 'Cancelling', 'prefect.state-type': 'CANCELLING', 'prefect.resource.id': 'prefect.flow-run.5a4dcd3b-d771-4291-b571-73887d6ce94c', 'prefect.resource.name': 'glossy-coucal', 'prefect.state-message': 'The parent flow run was cancelled.', 'prefect.state-timestamp'
m
thought for 177.5 seconds
Thanks for sharing that snippet. Here’s what that message means and what to do next. What it is - The exception
prefect.server.events.ordering.MaxDepthExceeded
comes from Prefect’s event “causal ordering” component. It tries to keep related events (like flow-run state changes) in logical order even if they arrive slightly out of order. - “MaxDepthExceeded” means the server saw an event (e.g.,
prefect.flow-run.Cancelling
) that appears to depend on a long chain of earlier events, and the internal buffer for tracking those predecessors hit its limit. In short: the ordering buffer was too deep for that event chain. Is it harmful? - Generally no for core orchestration. Your flow runs and state transitions are still processed. - The main effect is that some downstream consumers of events (e.g., automations/triggers or event-driven features) may handle a particular event without strict causal ordering in rare cases. If you are not seeing misfiring automations or missing actions, this is typically just noisy logging. Why it happens - Bursts of out-of-order events (e.g., many rapid state changes, cancellations cascaded from parent/child runs) - Backpressure or temporary delays in event processing - Occasional clock skew between components (server/workers) can make events “look” out of order Mitigations and next steps - Upgrade to the latest Prefect 3.x release (event handling and logging around ordering are improved over time). - Ensure time sync (NTP) on all machines running the server and workers to minimize perceived out-of-order delivery. - If you’re seeing real symptoms (automations firing out of order, missed actions), please share: - Your Prefect version - Self-hosted Server vs. Cloud - Database backend (SQLite/Postgres) - Frequency and a longer slice of the log (a few lines before/after the error) - Whether automations or notifications are behaving incorrectly - If it’s just log noise and you’re not seeing user-facing issues, it’s safe to ignore. References - GitHub code/search for this exception and the causal ordering module: Search “MaxDepthExceeded” in Prefect repo If you can share the details above, I can dig deeper and suggest more targeted steps.
e
@Marvin this is caused an issue. In my setup I have a parent flow that then triggers a subflow. When I cancel flows in said parent, they end up being stuck in a cancelling state. In the DB, is there a table that stores the state of these flows? I..e, "Unknown" because these flows seem to be running in the background somehow
m
thought for 183.6 seconds
Thanks for the extra context — getting stuck in Cancelling is definitely not just “log noise.” Direct answers to your questions - Yes, Prefect stores current and historical states in your Server database: - flow_runs: the current (cached) state for each flow run - flow_run_states: the full state history for each flow run - task_runs / task_run_states: the task analogs - There is no persisted “Unknown” state in the DB. “Unknown” is a UI/computed status; in the DB you’ll see the last known state (e.g., Cancelling/Running/etc.), not “Unknown”. Quick SQL you can run to inspect stuck runs - List flow runs stuck in Cancelling:
Copy code
SELECT id, name, state_type, state_name, state_id, start_time, end_time, parent_task_run_id
FROM flow_runs
WHERE state_name = 'Cancelling'
ORDER BY start_time DESC
LIMIT 100;
- Inspect the state history for a specific run:
Copy code
SELECT id, flow_run_id, "type", "name", timestamp, message
FROM flow_run_states
WHERE flow_run_id = '<FLOW_RUN_ID>'
ORDER BY timestamp DESC
LIMIT 50;
- Find subflow runs for a given parent flow run:
Copy code
SELECT sf.id, sf.name, sf.state_type, sf.state_name, sf.parent_task_run_id
FROM flow_runs sf
WHERE sf.parent_task_run_id IN (
  SELECT tr.id FROM task_runs tr WHERE tr.flow_run_id = '<PARENT_FLOW_RUN_ID>'
);
Why these can get stuck - When you cancel a parent, the child/subflow typically gets moved to Cancelling with a message like “The parent flow run was cancelled.” Finalizing to Cancelled requires either: - the worker/infrastructure to acknowledge and stop the run; or - the server’s CancellationCleanup service to eventually finalize the state if the infra has stopped but the final Cancelled event wasn’t observed. - The MaxDepthExceeded error you saw means the event ordering buffer overflowed during a burst of related events. In some cases, that can prevent the “final” Cancelled transition/event from being processed cleanly, leaving a run stuck in Cancelling. Things to check right now - Verify the Server’s cancellation cleanup service is running and healthy. In self-hosted deployment logs, look for entries from the CancellationCleanup service: - Code for reference: CancellationCleanup service - Confirm your workers actually stop the underlying processes/pods when a cancel is issued (e.g., Kubernetes job deletion, Docker container stop, subprocess kill). If the process is still running, the DB will reflect Cancelling or Running even if you want it gone. - Ensure your Server and Workers are on the latest 3.x versions and clocks are NTP-synced; the event ordering component is sensitive to clock skew and bursty, out-of-order events. How to unstick affected runs - Preferred: use the API to push a final Cancelled state. This will not kill the underlying process — make sure the workload is actually stopped at the infra level.
Copy code
import asyncio
from prefect.client.orchestration import get_client
from prefect.states import Cancelled

async def force_cancel(flow_run_id: str):
    async with get_client() as client:
        result = await client.set_flow_run_state(flow_run_id, state=Cancelled(message="Force cancelling stuck run"))
        print(result)

asyncio.run(force_cancel("<FLOW_RUN_ID>"))
- If you need to clean up multiple runs, you can list them (via the SQL above or the API) and loop the same client call. Avoid writing directly to the DB; use the API so orchestration rules remain consistent. Next details that will help me pinpoint the root cause - Prefect version (Server and Workers) - Self-hosted Server or Prefect Cloud - DB backend (Postgres or SQLite) - Worker type/infra (Kubernetes, Docker, Process) and how you’re cancelling (UI vs. API) - A short log excerpt around the time you cancel (including any MaxDepthExceeded lines and nearby cancellation messages)
With that, I can recommend concrete fixes (e.g., version-specific patches, worker configuration, or server tuning). For completeness, the ORM classes/tables referenced above are here: Server ORM models (FlowRun, FlowRunState, TaskRun, TaskRunState).
e
@Marvin what is that cancellation service you linked me to? What is its purpose
m
thought for 21.1 seconds
The CancellationCleanup service is a background server service that periodically finalizes runs that were cancelled but never reached a terminal Cancelled state. Why it exists - When you cancel a flow, the state typically goes Running -> Cancelling -> Cancelled. - If the underlying workload stops but the “final” Cancelled transition/event is missed or delayed (network hiccup, worker crash, event ordering overflow, etc.), the run can linger in Cancelling. - CancellationCleanup scans for those stuck runs and advances them to a terminal Cancelled state so your UI/queries reflect reality. How it works at a high level - Runs on the server alongside other services (Scheduler, LateRuns, etc.) - On a loop, finds flow/task runs in non-terminal states that look abandoned or in Cancelling for “too long” - Applies orchestration rules to move them forward to a terminal state (typically Cancelled), ensuring state transitions are valid Where it lives - Code: prefect.server.services.cancellation_cleanup.CancellationCleanup - It’s started automatically with Prefect Server (both in Cloud and self-hosted). You’ll see logs from this service in the server output. Why this matters for your issue - Your parent/subflow cancellations ending up stuck in Cancelling is exactly what this service is meant to resolve. If they aren’t being finalized: - The service may not be running properly - There may be version-specific bugs fixed in a newer release - The event ordering “MaxDepthExceeded” may have prevented the final transition; the service should still sweep them eventually If you can share your Prefect version and server logs around cancellation time, I can help confirm whether CancellationCleanup is active and whether you’re hitting a known edge case.
e
What version was this added in @Marvin?
m
thought for 81.5 seconds
Short answer: it’s not new in Prefect 3.x. The CancellationCleanup service has been part of the Prefect Server services since Prefect 2.x and was carried forward into 3.x. I don’t have the exact first tag/PR from here. If you need the precise version it landed in, I can dig up the commit history and send you the PR and release tag. Meanwhile, here are references: - Code (current main): CancellationCleanup service - Repo search: Search for “cancellation_cleanup” Also, which Prefect version are you running (Server and Workers)? That will help me check for any version-specific fixes related to stuck Cancelling runs.
e
prefect 3.4.4 @Marvin
m
thought for 124.7 seconds
Thanks — 3.4.4 already includes the CancellationCleanup service, so your server should be sweeping Cancelling runs to Cancelled automatically. What CancellationCleanup does in your version - It runs in the background with the server and periodically finds runs stuck in non‑terminal states (like Cancelling) after cancellation, then finalizes them to Cancelled. - Code for reference: CancellationCleanup service Given you’re still seeing runs stuck in Cancelling, it’s likely one of: 1) The service isn’t actually running in your deployment 2) The underlying workload never stopped (so Prefect keeps the run non‑terminal) 3) An event ordering hiccup (like the MaxDepthExceeded you saw) prevented the final Cancelled transition and the sweep didn’t catch it yet Quick checks - How are you starting the server? If you’re using a custom ASGI entrypoint (e.g., mounting only the API app with uvicorn/gunicorn), the background services may not be started. The “standard” server start path launches these services. If you can share how you start the server, I’ll confirm whether services are enabled. - Look for CancellationCleanup messages in the server logs. You should see periodic log lines from that service. - Make sure the worker actually terminates the workload on cancel (e.g., delete K8s jobs/pods, stop Docker containers, kill local processes). Ways to unstick affected runs now - Programmatically finalize them once you’ve confirmed the underlying process is stopped:
Copy code
import asyncio
from prefect.client.orchestration import get_client
from prefect.states import Cancelled

async def force_cancel(flow_run_id: str):
    async with get_client() as client:
        await client.set_flow_run_state(flow_run_id, state=Cancelled(message="Force finalizing stuck run"))

# one run
asyncio.run(force_cancel("<FLOW_RUN_ID>"))
- SQL to find the stuck runs:
Copy code
SELECT id, name, state_type, state_name, state_id, start_time, end_time, parent_task_run_id
FROM flow_runs
WHERE state_name = 'Cancelling'
ORDER BY start_time DESC
LIMIT 100;
And to view history for a specific run:
Copy code
SELECT id, flow_run_id, "type", "name", timestamp, message
FROM flow_run_states
WHERE flow_run_id = '<FLOW_RUN_ID>'
ORDER BY timestamp DESC
LIMIT 50;
Prevention/next steps - If you’re self-hosting, confirm services are enabled in your server process; if not, we can adjust your startup to include them. - Ensure time sync (NTP) across server/workers; causal ordering is sensitive to clock skew. - Consider upgrading to the latest 3.4.x — there have been incremental stability improvements in services/event handling over time. Latest notes: 3.4.18 changelog. If you can share: - How you start the server (command/container entrypoint) - Your work pool/worker type (Kubernetes, Docker, Process) - A short server log excerpt around cancellation time (including any “CancellationCleanup” and “MaxDepthExceeded” lines) …I can zero in on whether it’s a services startup issue, a worker stop/kill configuration issue, or an events ordering edge case and give you exact steps to fix.
Short answer: don’t purge Prefect’s tables directly. You’ll almost certainly break referential integrity and the scheduler. Use the API/CLI to delete runs, and if you just need to reclaim space, purge logs/events first or reset the DB entirely. A few points and options: What “Unknown” typically means - In Prefect OSS (3.x), “Unknown” in the UI is a fallback when the run’s current state can’t be resolved to a known state name/type. It’s often a stale or partially-updated run, not an actively-executing process. - These rows won’t consume compute, but they will occupy DB space and the scheduler/services may continue to touch them. Safe ways to clean up - Delete individual runs - CLI: you can delete by ID
Copy code
prefect flow-run delete <flow_run_id>
- UI: open the flow run → Actions → Delete - Bulk delete with a small script (recommended) - You can filter runs (by age, state, deployment, tags, etc.) then delete them via the Prefect client. - Example: delete runs older than 30 days (adjust as needed):
Copy code
import asyncio
    from datetime import datetime, timedelta, timezone

    from prefect.client.orchestration import get_client
    from prefect.client.schemas.filters import (
        FlowRunFilter,
        FlowRunFilterStartTime,
    )

    async def main():
        cutoff = datetime.now(timezone.utc) - timedelta(days=30)
        async with get_client() as client:
            # Pull a batch; re-run this script until cleared
            runs = await client.read_flow_runs(
                flow_run_filter=FlowRunFilter(
                    start_time=FlowRunFilterStartTime(before=cutoff)
                ),
                limit=1000,
            )
            for r in runs:
                await client.delete_flow_run(r.id)
            print(f"Deleted {len(runs)} runs")
    asyncio.run(main())
- Example: cancel stale RUNNING runs older than 2 hours (if that’s what “Unknown” actually is in your case):
Copy code
import asyncio
    from datetime import datetime, timedelta, timezone

    from prefect.client.orchestration import get_client
    from prefect.client.schemas.filters import (
        FlowRunFilter,
        FlowRunFilterStateType,
        FlowRunFilterStartTime,
    )
    from prefect.client.schemas.objects import StateType

    async def main():
        cutoff = datetime.now(timezone.utc) - timedelta(hours=2)
        async with get_client() as client:
            runs = await client.read_flow_runs(
                flow_run_filter=FlowRunFilter(
                    state=FlowRunFilterStateType(any_=[StateType.RUNNING]),
                    start_time=FlowRunFilterStartTime(before=cutoff),
                ),
                limit=1000,
            )
            for r in runs:
                # cancel via CLI is available; via client you can set a cancelled state if needed
                await client.delete_flow_run(r.id)
            print(f"Deleted {len(runs)} stale running runs")
    asyncio.run(main())
- Reduce DB growth - Logs and events are usually the biggest tables. If you don’t need events persisted, you can disable the Event Persister service: - Set
PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED=false
and restart the server. - After deletes, Postgres won’t shrink on disk immediately. Autovacuum will reclaim space internally, but to shrink the file you may need a manual VACUUM FULL or maintenance window—coordinate with your DBA. “Can I purge a specific table?” - Not recommended. Prefect’s tables have relationships (flow_runs, task_runs, states, logs, events, etc.). Truncating one in isolation can leave dangling references and break the API/UI/scheduler. - If you truly want a clean slate (dev/non-prod), use:
Copy code
prefect server database reset
This drops and recreates all Prefect tables for the current database—nuclear option. A couple of sanity checks - Make sure server services are running so stale runs get reconciled over time: ```
prefect server services list-services
Copy code
Look for MarkLateRuns, CancellationCleanup, etc. If needed, restart:
prefect server services stop prefect server services start ``` If you can share: - Prefect version (2.x vs 3.x) - A screenshot or the details pane of an “Unknown” run (state name/type, timestamps) - Which tables are largest in RDS I can tailor a precise bulk-cleanup script for your case (e.g., filter by deployment, tags, time ranges) and advise on safe retention settings.
e
@Marvin I have flows that have an 'unknown' state, where can I find these in the database? I.e., which tables, as Prefect doesn't appear to store these
m
thought for 247.5 seconds
Short answer: you won’t find a state named “Unknown” in the database. “Unknown” is a UI fallback shown when a run has no current state recorded. The real data lives in the run/state tables: - Flow runs:
flow_runs
- Flow run states:
flow_run_states
- Task runs:
task_runs
- Task run states:
task_run_states
What the UI calls “Unknown” generally means one of these is true: -
flow_runs.state_id
is NULL - The
state_id
points to a missing row (rare) - The denormalized columns `flow_runs.state_type`/`state_name` are NULL (because there’s no current state) - The run has no rows in
flow_run_states
at all Some practical queries you can use (Postgres/SQLite-style SQL): Find flow runs with no current state pointer
Copy code
SELECT fr.id, fr.name, fr.created
FROM flow_runs fr
LEFT JOIN flow_run_states frs ON fr.state_id = frs.id
WHERE fr.state_id IS NULL OR frs.id IS NULL;
Find flow runs that have zero state history
Copy code
SELECT fr.id, fr.name, fr.created
FROM flow_runs fr
LEFT JOIN flow_run_states frs ON frs.flow_run_id = fr.id
GROUP BY fr.id, fr.name, fr.created
HAVING COUNT(frs.id) = 0;
If your schema includes denormalized columns (common in Prefect 3.x), this is a quick check:
Copy code
SELECT id, name, created
FROM flow_runs
WHERE state_type IS NULL;
Notes: - The authoritative state history is always in `flow_run_states`/`task_run_states`. The current state is referenced by
flow_runs.state_id
(and
task_runs.state_id
) and often denormalized into `state_type`/`state_name` for fast filtering. - There is no
Unknown
value in the
state_type
enum; it’s not persisted as such. Docs for reference: - States concepts - Server schemas: states - REST API: flow run states If you can share which DB backend you’re using (Postgres vs. SQLite) and whether you’re seeing “Unknown” for flows, tasks, or both, I can tailor the exact queries.
e
@Marvin so these
Copy code
SELECT fr.id, fr.name, fr.created
FROM flow_runs fr
LEFT JOIN flow_run_states frs ON fr.state_id = frs.id
WHERE fr.state_id IS NULL OR frs.id IS NULL;
Return 601 rows, do I delete said rows? Will it purge the "Unknown" flows from the UI or?
m
thought for 220.9 seconds
Short answer: yes, those rows are exactly what the UI renders as “Unknown” (no current state). Do not delete directly from the database. Use Prefect’s API/CLI so referential integrity and denormalized columns stay consistent. What your query found -
flow_runs.state_id IS NULL
(or points to a missing row) => UI shows “Unknown”. - There’s no persisted “UNKNOWN” state; it’s a fallback when the run has no current state. Options to clean up 1) Remove them entirely (they’ll disappear from the UI) - CLI, one-off:
Copy code
prefect flow-run delete <FLOW_RUN_ID>
- CLI, bulk from a list:
Copy code
# write your 601 ids to a file first (ids.txt)
cat ids.txt | xargs -n1 prefect flow-run delete
- Python (async):
Copy code
from prefect.client.orchestration import get_client

ids = [...]  # list of flow run UUIDs

async def main():
    async with get_client() as client:
        for rid in ids:
            await client.delete_flow_run(rid)
2) “Repair” them by setting a terminal state instead of deleting - This keeps the run for history but gets it out of “Unknown.” - REST: Set flow run state - Python (async):
Copy code
from prefect.client.orchestration import get_client
from prefect.client.schemas.objects import State, StateType

ids = [...]  # list of flow run UUIDs

async def main():
    async with get_client() as client:
        for rid in ids:
            await client.set_flow_run_state(
                rid,
                state=State(type=StateType.FAILED, name="Failed"),
                force=True,  # safe here since there's no current state
            )
Safety checks before bulk actions - Prefer deleting or setting state via API/CLI instead of SQL DELETE. - If you only want to target “true orphans,” narrow to runs with no history:
Copy code
SELECT fr.id
FROM flow_runs fr
LEFT JOIN flow_run_states frs ON frs.flow_run_id = fr.id
GROUP BY fr.id
HAVING COUNT(frs.id) = 0;
References - States overview: States concepts - REST: Set flow run state If you’d like, share whether you prefer to delete or repair. I can help you turn your SQL result into a quick script to perform the action in bulk.
e
@Marvin why not SQL delete?
m
thought for 34.9 seconds
Great question. Short answer: direct SQL DELETE can leave your Prefect database in an inconsistent state. The API/CLI paths do more than remove a row; they keep related tables and denormalized fields consistent and may emit events used elsewhere. Why not SQL delete? - Referential integrity and children: Flow runs have related rows in
flow_run_states
,
task_runs
,
task_run_states
,
logs
,
artifacts
,
flow_run_input
, etc. Depending on your DB and schema version, not all FKs are guaranteed to cascade. A raw DELETE from
flow_runs
can orphan children or fail due to FK constraints. - Denormalized fields: Prefect 3.x keeps convenience columns like `flow_runs.state_type`/`state_name`. API operations maintain these correctly; raw SQL may not. - Orchestration side-effects: Deleting/closing a run through the API can release resources, emit events, and keep UI counters/queries aligned. Raw SQL bypasses all of that. - Upgrades and support: The DB schema is an internal implementation detail and can change. Manual edits may work today and break on upgrade—or make support/debugging harder. What to do instead - If you want them gone: delete via CLI/API (safe and supported).
Copy code
prefect flow-run delete <FLOW_RUN_ID>
# bulk
cat ids.txt | xargs -n1 prefect flow-run delete
- If you want them recorded as “not successful” rather than removed: set a terminal state (e.g., Failed) via the client/REST so they stop showing as Unknown without losing history.
Copy code
from prefect.client.orchestration import get_client
from prefect.client.schemas.objects import State, StateType

ids = [...]  # your 601 IDs

async def main():
    async with get_client() as client:
        for rid in ids:
            await client.set_flow_run_state(
                rid,
                state=State(type=StateType.FAILED, name="Failed"),
                force=True,
            )
REST ref: Set flow run state If you still prefer SQL - Back up your DB first. - Only delete runs you’re sure have no children/history, or explicitly delete children first. - Test on a staging copy. Verify no orphaned
logs
,
artifacts
,
task_runs
, or
*_states
remain. If you paste a few sample run IDs or confirm whether you’d rather delete vs. mark as Failed, I can provide a ready-to-run script to process all 601 safely.