< Marvin> in Prefect OSS I have a lot of flows in an Unknown Prefect Community #ask-marvin

<@ULVA73B9P> in Prefect OSS. I have a lot of flows...

ESmith

09/16/2025, 1:57 PM

@Marvin in Prefect OSS. I have a lot of flows in an "Unknown" state which are seemingly trying to run and thus filling up the underlying RDS DB. Is there a certain table that I can purge that would get rid of these?

Marvin

09/16/2025, 1:58 PM

✅ thought for 2738.8 seconds

ESmith

09/16/2025, 1:59 PM

In the prefect server logs I have errors that look like @Marvin:

Copy code

prefect.server.events.ordering.MaxDepthExceeded: occurred=DateTime(2025, 9, 16, 13, 47, 13, 77967, tzinfo=Timezone('UTC')) event='prefect.flow-run.Cancelling' resource=Resource(root={'prefect.state-name': 'Cancelling', 'prefect.state-type': 'CANCELLING', 'prefect.resource.id': 'prefect.flow-run.5a4dcd3b-d771-4291-b571-73887d6ce94c', 'prefect.resource.name': 'glossy-coucal', 'prefect.state-message': 'The parent flow run was cancelled.', 'prefect.state-timestamp'

Marvin

09/16/2025, 1:59 PM

✅ thought for 177.5 seconds

Marvin

09/16/2025, 2:02 PM

Thanks for sharing that snippet. Here’s what that message means and what to do next. What it is - The exception

prefect.server.events.ordering.MaxDepthExceeded

comes from Prefect’s event “causal ordering” component. It tries to keep related events (like flow-run state changes) in logical order even if they arrive slightly out of order. - “MaxDepthExceeded” means the server saw an event (e.g.,

prefect.flow-run.Cancelling

) that appears to depend on a long chain of earlier events, and the internal buffer for tracking those predecessors hit its limit. In short: the ordering buffer was too deep for that event chain. Is it harmful? - Generally no for core orchestration. Your flow runs and state transitions are still processed. - The main effect is that some downstream consumers of events (e.g., automations/triggers or event-driven features) may handle a particular event without strict causal ordering in rare cases. If you are not seeing misfiring automations or missing actions, this is typically just noisy logging. Why it happens - Bursts of out-of-order events (e.g., many rapid state changes, cancellations cascaded from parent/child runs) - Backpressure or temporary delays in event processing - Occasional clock skew between components (server/workers) can make events “look” out of order Mitigations and next steps - Upgrade to the latest Prefect 3.x release (event handling and logging around ordering are improved over time). - Ensure time sync (NTP) on all machines running the server and workers to minimize perceived out-of-order delivery. - If you’re seeing real symptoms (automations firing out of order, missed actions), please share: - Your Prefect version - Self-hosted Server vs. Cloud - Database backend (SQLite/Postgres) - Frequency and a longer slice of the log (a few lines before/after the error) - Whether automations or notifications are behaving incorrectly - If it’s just log noise and you’re not seeing user-facing issues, it’s safe to ignore. References - GitHub code/search for this exception and the causal ordering module: Search “MaxDepthExceeded” in Prefect repo If you can share the details above, I can dig deeper and suggest more targeted steps.

ESmith

09/16/2025, 2:04 PM

@Marvin this is caused an issue. In my setup I have a parent flow that then triggers a subflow. When I cancel flows in said parent, they end up being stuck in a cancelling state. In the DB, is there a table that stores the state of these flows? I..e, "Unknown" because these flows seem to be running in the background somehow

Marvin

09/16/2025, 2:04 PM

✅ thought for 183.6 seconds

Marvin

09/16/2025, 2:07 PM

Thanks for the extra context — getting stuck in Cancelling is definitely not just “log noise.” Direct answers to your questions - Yes, Prefect stores current and historical states in your Server database: - flow_runs: the current (cached) state for each flow run - flow_run_states: the full state history for each flow run - task_runs / task_run_states: the task analogs - There is no persisted “Unknown” state in the DB. “Unknown” is a UI/computed status; in the DB you’ll see the last known state (e.g., Cancelling/Running/etc.), not “Unknown”. Quick SQL you can run to inspect stuck runs - List flow runs stuck in Cancelling:

Copy code

SELECT id, name, state_type, state_name, state_id, start_time, end_time, parent_task_run_id
FROM flow_runs
WHERE state_name = 'Cancelling'
ORDER BY start_time DESC
LIMIT 100;

- Inspect the state history for a specific run:

Copy code

SELECT id, flow_run_id, "type", "name", timestamp, message
FROM flow_run_states
WHERE flow_run_id = '<FLOW_RUN_ID>'
ORDER BY timestamp DESC
LIMIT 50;

- Find subflow runs for a given parent flow run:

Copy code

SELECT sf.id, sf.name, sf.state_type, sf.state_name, sf.parent_task_run_id
FROM flow_runs sf
WHERE sf.parent_task_run_id IN (
  SELECT tr.id FROM task_runs tr WHERE tr.flow_run_id = '<PARENT_FLOW_RUN_ID>'
);

Why these can get stuck - When you cancel a parent, the child/subflow typically gets moved to Cancelling with a message like “The parent flow run was cancelled.” Finalizing to Cancelled requires either: - the worker/infrastructure to acknowledge and stop the run; or - the server’s CancellationCleanup service to eventually finalize the state if the infra has stopped but the final Cancelled event wasn’t observed. - The MaxDepthExceeded error you saw means the event ordering buffer overflowed during a burst of related events. In some cases, that can prevent the “final” Cancelled transition/event from being processed cleanly, leaving a run stuck in Cancelling. Things to check right now - Verify the Server’s cancellation cleanup service is running and healthy. In self-hosted deployment logs, look for entries from the CancellationCleanup service: - Code for reference: CancellationCleanup service - Confirm your workers actually stop the underlying processes/pods when a cancel is issued (e.g., Kubernetes job deletion, Docker container stop, subprocess kill). If the process is still running, the DB will reflect Cancelling or Running even if you want it gone. - Ensure your Server and Workers are on the latest 3.x versions and clocks are NTP-synced; the event ordering component is sensitive to clock skew and bursty, out-of-order events. How to unstick affected runs - Preferred: use the API to push a final Cancelled state. This will not kill the underlying process — make sure the workload is actually stopped at the infra level.

Copy code

import asyncio
from prefect.client.orchestration import get_client
from prefect.states import Cancelled

async def force_cancel(flow_run_id: str):
    async with get_client() as client:
        result = await client.set_flow_run_state(flow_run_id, state=Cancelled(message="Force cancelling stuck run"))
        print(result)

asyncio.run(force_cancel("<FLOW_RUN_ID>"))

- If you need to clean up multiple runs, you can list them (via the SQL above or the API) and loop the same client call. Avoid writing directly to the DB; use the API so orchestration rules remain consistent. Next details that will help me pinpoint the root cause - Prefect version (Server and Workers) - Self-hosted Server or Prefect Cloud - DB backend (Postgres or SQLite) - Worker type/infra (Kubernetes, Docker, Process) and how you’re cancelling (UI vs. API) - A short log excerpt around the time you cancel (including any MaxDepthExceeded lines and nearby cancellation messages)

Marvin

09/16/2025, 2:07 PM

With that, I can recommend concrete fixes (e.g., version-specific patches, worker configuration, or server tuning). For completeness, the ORM classes/tables referenced above are here: Server ORM models (FlowRun, FlowRunState, TaskRun, TaskRunState).

ESmith

09/16/2025, 2:13 PM

@Marvin what is that cancellation service you linked me to? What is its purpose

Marvin

09/16/2025, 2:13 PM

✅ thought for 21.1 seconds

Marvin

09/16/2025, 2:13 PM

The CancellationCleanup service is a background server service that periodically finalizes runs that were cancelled but never reached a terminal Cancelled state. Why it exists - When you cancel a flow, the state typically goes Running -> Cancelling -> Cancelled. - If the underlying workload stops but the “final” Cancelled transition/event is missed or delayed (network hiccup, worker crash, event ordering overflow, etc.), the run can linger in Cancelling. - CancellationCleanup scans for those stuck runs and advances them to a terminal Cancelled state so your UI/queries reflect reality. How it works at a high level - Runs on the server alongside other services (Scheduler, LateRuns, etc.) - On a loop, finds flow/task runs in non-terminal states that look abandoned or in Cancelling for “too long” - Applies orchestration rules to move them forward to a terminal state (typically Cancelled), ensuring state transitions are valid Where it lives - Code: prefect.server.services.cancellation_cleanup.CancellationCleanup - It’s started automatically with Prefect Server (both in Cloud and self-hosted). You’ll see logs from this service in the server output. Why this matters for your issue - Your parent/subflow cancellations ending up stuck in Cancelling is exactly what this service is meant to resolve. If they aren’t being finalized: - The service may not be running properly - There may be version-specific bugs fixed in a newer release - The event ordering “MaxDepthExceeded” may have prevented the final transition; the service should still sweep them eventually If you can share your Prefect version and server logs around cancellation time, I can help confirm whether CancellationCleanup is active and whether you’re hitting a known edge case.

ESmith

09/16/2025, 2:15 PM

What version was this added in @Marvin?

Marvin

09/16/2025, 2:15 PM

✅ thought for 81.5 seconds

Marvin

09/16/2025, 2:16 PM

Short answer: it’s not new in Prefect 3.x. The CancellationCleanup service has been part of the Prefect Server services since Prefect 2.x and was carried forward into 3.x. I don’t have the exact first tag/PR from here. If you need the precise version it landed in, I can dig up the commit history and send you the PR and release tag. Meanwhile, here are references: - Code (current main): CancellationCleanup service - Repo search: Search for “cancellation_cleanup” Also, which Prefect version are you running (Server and Workers)? That will help me check for any version-specific fixes related to stuck Cancelling runs.

ESmith

09/16/2025, 2:19 PM

prefect 3.4.4 @Marvin

Marvin

09/16/2025, 2:20 PM

✅ thought for 124.7 seconds

Marvin

09/16/2025, 2:22 PM

Thanks — 3.4.4 already includes the CancellationCleanup service, so your server should be sweeping Cancelling runs to Cancelled automatically. What CancellationCleanup does in your version - It runs in the background with the server and periodically finds runs stuck in non‑terminal states (like Cancelling) after cancellation, then finalizes them to Cancelled. - Code for reference: CancellationCleanup service Given you’re still seeing runs stuck in Cancelling, it’s likely one of: 1) The service isn’t actually running in your deployment 2) The underlying workload never stopped (so Prefect keeps the run non‑terminal) 3) An event ordering hiccup (like the MaxDepthExceeded you saw) prevented the final Cancelled transition and the sweep didn’t catch it yet Quick checks - How are you starting the server? If you’re using a custom ASGI entrypoint (e.g., mounting only the API app with uvicorn/gunicorn), the background services may not be started. The “standard” server start path launches these services. If you can share how you start the server, I’ll confirm whether services are enabled. - Look for CancellationCleanup messages in the server logs. You should see periodic log lines from that service. - Make sure the worker actually terminates the workload on cancel (e.g., delete K8s jobs/pods, stop Docker containers, kill local processes). Ways to unstick affected runs now - Programmatically finalize them once you’ve confirmed the underlying process is stopped:

Copy code

import asyncio
from prefect.client.orchestration import get_client
from prefect.states import Cancelled

async def force_cancel(flow_run_id: str):
    async with get_client() as client:
        await client.set_flow_run_state(flow_run_id, state=Cancelled(message="Force finalizing stuck run"))

# one run
asyncio.run(force_cancel("<FLOW_RUN_ID>"))

- SQL to find the stuck runs:

Copy code

SELECT id, name, state_type, state_name, state_id, start_time, end_time, parent_task_run_id
FROM flow_runs
WHERE state_name = 'Cancelling'
ORDER BY start_time DESC
LIMIT 100;

And to view history for a specific run:

Copy code

SELECT id, flow_run_id, "type", "name", timestamp, message
FROM flow_run_states
WHERE flow_run_id = '<FLOW_RUN_ID>'
ORDER BY timestamp DESC
LIMIT 50;

Prevention/next steps - If you’re self-hosting, confirm services are enabled in your server process; if not, we can adjust your startup to include them. - Ensure time sync (NTP) across server/workers; causal ordering is sensitive to clock skew. - Consider upgrading to the latest 3.4.x — there have been incremental stability improvements in services/event handling over time. Latest notes: 3.4.18 changelog. If you can share: - How you start the server (command/container entrypoint) - Your work pool/worker type (Kubernetes, Docker, Process) - A short server log excerpt around cancellation time (including any “CancellationCleanup” and “MaxDepthExceeded” lines) …I can zero in on whether it’s a services startup issue, a worker stop/kill configuration issue, or an events ordering edge case and give you exact steps to fix.

Marvin

09/16/2025, 2:43 PM

Short answer: don’t purge Prefect’s tables directly. You’ll almost certainly break referential integrity and the scheduler. Use the API/CLI to delete runs, and if you just need to reclaim space, purge logs/events first or reset the DB entirely. A few points and options: What “Unknown” typically means - In Prefect OSS (3.x), “Unknown” in the UI is a fallback when the run’s current state can’t be resolved to a known state name/type. It’s often a stale or partially-updated run, not an actively-executing process. - These rows won’t consume compute, but they will occupy DB space and the scheduler/services may continue to touch them. Safe ways to clean up - Delete individual runs - CLI: you can delete by ID

Copy code

prefect flow-run delete <flow_run_id>

- UI: open the flow run → Actions → Delete - Bulk delete with a small script (recommended) - You can filter runs (by age, state, deployment, tags, etc.) then delete them via the Prefect client. - Example: delete runs older than 30 days (adjust as needed):

Copy code

import asyncio
    from datetime import datetime, timedelta, timezone

    from prefect.client.orchestration import get_client
    from prefect.client.schemas.filters import (
        FlowRunFilter,
        FlowRunFilterStartTime,
    )

    async def main():
        cutoff = datetime.now(timezone.utc) - timedelta(days=30)
        async with get_client() as client:
            # Pull a batch; re-run this script until cleared
            runs = await client.read_flow_runs(
                flow_run_filter=FlowRunFilter(
                    start_time=FlowRunFilterStartTime(before=cutoff)
                ),
                limit=1000,
            )
            for r in runs:
                await client.delete_flow_run(r.id)
            print(f"Deleted {len(runs)} runs")
    asyncio.run(main())

- Example: cancel stale RUNNING runs older than 2 hours (if that’s what “Unknown” actually is in your case):

Copy code

import asyncio
    from datetime import datetime, timedelta, timezone

    from prefect.client.orchestration import get_client
    from prefect.client.schemas.filters import (
        FlowRunFilter,
        FlowRunFilterStateType,
        FlowRunFilterStartTime,
    )
    from prefect.client.schemas.objects import StateType

    async def main():
        cutoff = datetime.now(timezone.utc) - timedelta(hours=2)
        async with get_client() as client:
            runs = await client.read_flow_runs(
                flow_run_filter=FlowRunFilter(
                    state=FlowRunFilterStateType(any_=[StateType.RUNNING]),
                    start_time=FlowRunFilterStartTime(before=cutoff),
                ),
                limit=1000,
            )
            for r in runs:
                # cancel via CLI is available; via client you can set a cancelled state if needed
                await client.delete_flow_run(r.id)
            print(f"Deleted {len(runs)} stale running runs")
    asyncio.run(main())

- Reduce DB growth - Logs and events are usually the biggest tables. If you don’t need events persisted, you can disable the Event Persister service: - Set

PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED=false

and restart the server. - After deletes, Postgres won’t shrink on disk immediately. Autovacuum will reclaim space internally, but to shrink the file you may need a manual VACUUM FULL or maintenance window—coordinate with your DBA. “Can I purge a specific table?” - Not recommended. Prefect’s tables have relationships (flow_runs, task_runs, states, logs, events, etc.). Truncating one in isolation can leave dangling references and break the API/UI/scheduler. - If you truly want a clean slate (dev/non-prod), use:

Copy code

prefect server database reset

This drops and recreates all Prefect tables for the current database—nuclear option. A couple of sanity checks - Make sure server services are running so stale runs get reconciled over time: ```

Marvin

09/16/2025, 2:43 PM

prefect server services list-services

Copy code

Look for MarkLateRuns, CancellationCleanup, etc. If needed, restart:

prefect server services stop prefect server services start ``` If you can share: - Prefect version (2.x vs 3.x) - A screenshot or the details pane of an “Unknown” run (state name/type, timestamps) - Which tables are largest in RDS I can tailor a precise bulk-cleanup script for your case (e.g., filter by deployment, tags, time ranges) and advise on safe retention settings.

ESmith

09/17/2025, 8:38 AM

@Marvin I have flows that have an 'unknown' state, where can I find these in the database? I.e., which tables, as Prefect doesn't appear to store these

Marvin

09/17/2025, 8:38 AM

✅ thought for 247.5 seconds

Marvin

09/17/2025, 8:42 AM

Short answer: you won’t find a state named “Unknown” in the database. “Unknown” is a UI fallback shown when a run has no current state recorded. The real data lives in the run/state tables: - Flow runs:

flow_runs

- Flow run states:

flow_run_states

- Task runs:

task_runs

- Task run states:

task_run_states

What the UI calls “Unknown” generally means one of these is true: -

flow_runs.state_id

is NULL - The

state_id

points to a missing row (rare) - The denormalized columns `flow_runs.state_type`/`state_name` are NULL (because there’s no current state) - The run has no rows in

flow_run_states

at all Some practical queries you can use (Postgres/SQLite-style SQL): Find flow runs with no current state pointer

Copy code

SELECT fr.id, fr.name, fr.created
FROM flow_runs fr
LEFT JOIN flow_run_states frs ON fr.state_id = frs.id
WHERE fr.state_id IS NULL OR frs.id IS NULL;

Find flow runs that have zero state history

Copy code

SELECT fr.id, fr.name, fr.created
FROM flow_runs fr
LEFT JOIN flow_run_states frs ON frs.flow_run_id = fr.id
GROUP BY fr.id, fr.name, fr.created
HAVING COUNT(frs.id) = 0;

If your schema includes denormalized columns (common in Prefect 3.x), this is a quick check:

Copy code

SELECT id, name, created
FROM flow_runs
WHERE state_type IS NULL;

Notes: - The authoritative state history is always in `flow_run_states`/`task_run_states`. The current state is referenced by

flow_runs.state_id

(and

task_runs.state_id

) and often denormalized into `state_type`/`state_name` for fast filtering. - There is no

Unknown

value in the

state_type

enum; it’s not persisted as such. Docs for reference: - States concepts - Server schemas: states - REST API: flow run states If you can share which DB backend you’re using (Postgres vs. SQLite) and whether you’re seeing “Unknown” for flows, tasks, or both, I can tailor the exact queries.

ESmith

09/17/2025, 8:47 AM

@Marvin so these

Copy code

SELECT fr.id, fr.name, fr.created
FROM flow_runs fr
LEFT JOIN flow_run_states frs ON fr.state_id = frs.id
WHERE fr.state_id IS NULL OR frs.id IS NULL;

Return 601 rows, do I delete said rows? Will it purge the "Unknown" flows from the UI or?

Marvin

09/17/2025, 8:47 AM

✅ thought for 220.9 seconds

Marvin

09/17/2025, 8:51 AM

Short answer: yes, those rows are exactly what the UI renders as “Unknown” (no current state). Do not delete directly from the database. Use Prefect’s API/CLI so referential integrity and denormalized columns stay consistent. What your query found -

flow_runs.state_id IS NULL

(or points to a missing row) => UI shows “Unknown”. - There’s no persisted “UNKNOWN” state; it’s a fallback when the run has no current state. Options to clean up 1) Remove them entirely (they’ll disappear from the UI) - CLI, one-off:

Copy code

prefect flow-run delete <FLOW_RUN_ID>

- CLI, bulk from a list:

Copy code

# write your 601 ids to a file first (ids.txt)
cat ids.txt | xargs -n1 prefect flow-run delete

- Python (async):

Copy code

from prefect.client.orchestration import get_client

ids = [...]  # list of flow run UUIDs

async def main():
    async with get_client() as client:
        for rid in ids:
            await client.delete_flow_run(rid)

2) “Repair” them by setting a terminal state instead of deleting - This keeps the run for history but gets it out of “Unknown.” - REST: Set flow run state - Python (async):

Copy code

from prefect.client.orchestration import get_client
from prefect.client.schemas.objects import State, StateType

ids = [...]  # list of flow run UUIDs

async def main():
    async with get_client() as client:
        for rid in ids:
            await client.set_flow_run_state(
                rid,
                state=State(type=StateType.FAILED, name="Failed"),
                force=True,  # safe here since there's no current state
            )

Safety checks before bulk actions - Prefer deleting or setting state via API/CLI instead of SQL DELETE. - If you only want to target “true orphans,” narrow to runs with no history:

Copy code

SELECT fr.id
FROM flow_runs fr
LEFT JOIN flow_run_states frs ON frs.flow_run_id = fr.id
GROUP BY fr.id
HAVING COUNT(frs.id) = 0;

References - States overview: States concepts - REST: Set flow run state If you’d like, share whether you prefer to delete or repair. I can help you turn your SQL result into a quick script to perform the action in bulk.

ESmith

09/17/2025, 8:53 AM

@Marvin why not SQL delete?

Marvin

09/17/2025, 8:53 AM

✅ thought for 34.9 seconds

Marvin

09/17/2025, 8:54 AM

Great question. Short answer: direct SQL DELETE can leave your Prefect database in an inconsistent state. The API/CLI paths do more than remove a row; they keep related tables and denormalized fields consistent and may emit events used elsewhere. Why not SQL delete? - Referential integrity and children: Flow runs have related rows in

flow_run_states

task_runs

task_run_states

logs

artifacts

flow_run_input

, etc. Depending on your DB and schema version, not all FKs are guaranteed to cascade. A raw DELETE from

flow_runs

can orphan children or fail due to FK constraints. - Denormalized fields: Prefect 3.x keeps convenience columns like `flow_runs.state_type`/`state_name`. API operations maintain these correctly; raw SQL may not. - Orchestration side-effects: Deleting/closing a run through the API can release resources, emit events, and keep UI counters/queries aligned. Raw SQL bypasses all of that. - Upgrades and support: The DB schema is an internal implementation detail and can change. Manual edits may work today and break on upgrade—or make support/debugging harder. What to do instead - If you want them gone: delete via CLI/API (safe and supported).

Copy code

prefect flow-run delete <FLOW_RUN_ID>
# bulk
cat ids.txt | xargs -n1 prefect flow-run delete

- If you want them recorded as “not successful” rather than removed: set a terminal state (e.g., Failed) via the client/REST so they stop showing as Unknown without losing history.

Copy code

from prefect.client.orchestration import get_client
from prefect.client.schemas.objects import State, StateType

ids = [...]  # your 601 IDs

async def main():
    async with get_client() as client:
        for rid in ids:
            await client.set_flow_run_state(
                rid,
                state=State(type=StateType.FAILED, name="Failed"),
                force=True,
            )

REST ref: Set flow run state If you still prefer SQL - Back up your DB first. - Only delete runs you’re sure have no children/history, or explicitly delete children first. - Test on a staging copy. Verify no orphaned

logs

artifacts

task_runs

, or

*_states

remain. If you paste a few sample run IDs or confirm whether you’d rather delete vs. mark as Failed, I can provide a ready-to-run script to process all 601 safely.

3 Views

Open in Slack

Previous Next