<@ULVA73B9P> I'm seeing this error in the Prefect ...
# ask-marvin
j
@Marvin I'm seeing this error in the Prefect server
Copy code
File "/usr/local/lib/python3.12/site-packages/prefect/server/events/ordering.py", line 180, in preceding_event_confirmed
    raise MaxDepthExceeded(event)
prefect.server.events.ordering.MaxDepthExceeded
when we run a high number of Prefect jobs (1000) with a concurrency limit of 50, what does it mean?
m
thought for 114.5 seconds
Short answer: it’s coming from Prefect’s events subsystem when it tries to confirm the causal order of an event and finds an unusually long “follows” chain. There’s a safety cap (depth 20). Under heavy load (e.g., lots of runs with frequent state changes), a single resource (flow run or task run) can accumulate a deep chain of related events; once the chain exceeds the cap, Prefect raises
MaxDepthExceeded
. What it means and the impact - Prefect Server uses event ordering to evaluate automations/triggers in the correct sequence. When a chain is too deep, the server skips evaluating that event for triggers and logs this error. - Your runs still orchestrate, but any automations depending on precise event ordering for that specific event may not fire. - This often correlates with “event thrash” (rapid or oscillating state transitions), e.g., repeated cancelling/cancelled flips, aggressive retries, or a long backlog of dependent events due to contention. Why you might see it with 1000 runs and a concurrency limit of 50 - High contention can create many quick state transitions across the same resources as runs compete for capacity, which lengthens the event “follows” chains. - If any run gets into a loop (e.g., repeated transitions), that single run can generate thousands of events and trigger this error repeatedly. References and context - Code path where the exception is raised: ordering.py - Related investigation with stack traces and root-cause analysis of long event chains: GitHub issue #15399 What you can do next - Check for noisy runs: - In the UI, look for runs with a large number of state transitions or that seem “stuck” oscillating (e.g., Cancelling <-> Cancelled, or constant retries). - If you’re self-hosting and comfortable with DB checks, look for a run/resource generating an unusually high count of events and resolve/stop it. - Reduce event thrash: - Avoid rapid cancel/retry loops; add backoff on retries. - Revisit any automations that might be flipping states frequently. - Consider whether your concurrency configuration is causing heavy contention; sometimes increasing concurrency slightly or smoothing submission bursts reduces churn. - Upgrade to the latest Prefect 3.x: - Several improvements have landed around event processing and state transitions. If you can share your exact Prefect version, I can confirm whether relevant fixes are included. - If you’re not using Automations/Triggers: - The error is mostly log noise; orchestration still works. But I’d still recommend addressing any runaway runs to prevent table growth and noisy logs. If you can share: - Your Prefect version (server and agents) - Whether you’re using Automations/Triggers - Whether you see repeated state flips (e.g., Cancelling/Cancelled) or unusually long-running “stuck” runs …I can suggest more targeted steps.
j
@Marvin I'm using an automation that checks if a flow run enters state Crashed, it changes the state to Scheduled
m
thought for 231.9 seconds
Got it — that automation (on flow-run enters Crashed → set state to Scheduled) is very likely what’s creating the long “follows” chain that triggers MaxDepthExceeded. Why: every time the same flow run crashes and your automation reschedules it, you append more state-change events to the same resource. If a run keeps crashing (or is repeatedly rescheduled under contention), that single run’s event chain can grow deep enough to hit the server’s protective cap (depth 20), which raises
MaxDepthExceeded
and can cause automations on that event to be skipped. See the code path here: ordering.py and a related investigation: GitHub issue #15399. Recommended adjustments - Prefer “run a new deployment” over “change state to Scheduled”: - On Crashed, trigger an action that starts a fresh flow run (Run deployment) and optionally cancels the crashed run. This breaks the long chain on a single resource because each retry is a new flow run. - Add guardrails to avoid loops: - Add a cooldown/delay before the restart so you don’t immediately re-crash. - Cap restarts (e.g., only restart up to N times per window). If your automation can’t express that directly, consider moving to “Run deployment” with fixed parameters and enforce max-attempts in your flow code or via deployment parameters. - Reduce event thrash: - If many runs are competing for capacity (1000 runs, limit 50), consider smoothing submissions or slightly adjusting concurrency so runs don’t repeatedly bounce. - Fix the root crash: - Since you’re on Vertex AI work pools, crashes are often infra-related (preemptions, OOM, image mismatches). Given you’ve already hit architecture mismatch issues in Docker, also check: - Memory/CPU on the Vertex AI worker spec - Node preemption settings - Image consistency (Python/Prefect versions match worker/server) Notes - The max depth is a server safety limit; the practical way to avoid it is to prevent very long causal chains on one run. - Orchestration continues, but automations tied to those deep-chain events can be skipped when the cap is hit. If you can share: - Your Prefect version (server and workers) - A snapshot of how the automation is configured - Whether the same runs are repeatedly crashing (vs. lots of unique runs crashing once) …I can suggest an exact automation configuration (e.g., switch to “Run deployment” + optional delay + cancel) that will minimize event-chain depth and avoid this error.
j
This is the trigger JSON
Copy code
{
  "type": "event",
  "match": {
    "prefect.run-count": [
      "0",
      "1"
    ],
    "prefect.resource.id": "prefect.flow-run.*"
  },
  "match_related": {
    "prefect.resource.id": "prefect.deployment.*",
    "prefect.resource.name": "my-deployment"
  },
  "after": [],
  "expect": [
    "prefect.flow-run.Crashed"
  ],
  "for_each": [
    "prefect.resource.id"
  ],
  "posture": "Reactive",
  "threshold": 1,
  "within": 30
}
Ideally if I can use Run deployment instead of Change flow run's state, could be better
@Marvin , some more context above
m
thought for 105.4 seconds
Thanks — that helps. An automation that flips a Crashed flow run back to Scheduled is exactly the kind of pattern that can create very long “follows” chains on a single flow run, which leads to the MaxDepthExceeded error. Each crash → reschedule → crash cycle keeps appending events to the same resource, and under load those chains can exceed the server’s safety cap. Safer patterns to avoid deep event chains - Prefer creating a new run instead of rescheduling the same one: - On Crashed, use a “Run deployment” action (with source = inferred) to start a fresh flow run of the same deployment, and do not change the crashed run back to Scheduled. - Optionally add a “Cancel flow run” action before running a new deployment; if the run is already terminal Crashed, cancel is a no‑op, but it avoids any lingering non-terminal states. - Actions reference (code): server/events/actions.py. You’ll see
RunDeployment
,
ChangeFlowRunState
, and
CancelFlowRun
there. - Add guardrails to prevent loops: - Put a reasonable upper bound on restarts (e.g., move the retry policy into the flow via
retries
and backoff). Bounded retries keep the per-run event chain short. - If you must restart via automation, limit the scope (specific deployment, tag, or flow) and consider a human notification after N restarts. - Fix the root crash cause: - Since you’re on Vertex AI work pools, double‑check worker image consistency, resource limits (CPU/memory), and preemption. Your past architecture mismatch issues are a common crash source. Concrete automation change - Current: Trigger = flow-run enters Crashed; Action = Change state to Scheduled (on the same run) → causes long chains. - Proposed: Trigger = flow-run enters Crashed; Actions: 1) Cancel flow run 2) Run deployment with source = inferred (i.e., “run the same deployment again”) Example shape (conceptual)
Copy code
{
  "trigger": {
    "event": "prefect.flow-run.Crashed"
  },
  "filters": {
    "resource": {
      "kind": "prefect.flow-run"
    }
  },
  "actions": [
    {
      "type": "cancel-flow-run"
    },
    {
      "type": "run-deployment",
      "source": "inferred",
      "parameters": {
        "restart": true
      }
    }
  ]
}
Useful references - Where the error is raised: events/ordering.py - Related investigation that ties long event chains to loops: GitHub issue #15399 - Automation actions API overview: Automations actions (docs) - Automations model endpoints: Automations models (docs) If you share: - Your Prefect version (server and workers) - A quick screenshot or JSON export of the automation - Whether the same runs repeatedly crash vs many different runs …I can suggest an exact configuration (including scoping/filters and an example “Run deployment (inferred)” setup) that will minimize event-chain depth and stop the MaxDepthExceeded spam.
j
@Marvin And if I want to re-use the same parameters that the crashed flow had, how can I do that?
m
🔄 Researching your answer... this may take a while 🔧 Using tool:
research_prefect_topic
📊 Tools used: •
research_prefect_topic
(1x)
j
@Marvin what is the active slots in the concurrency limits?
m
thought for 184.9 seconds
Short answer: “Active slots” are the task runs currently holding a concurrency slot for a given limit. Details: - In Prefect 3.x, tag-based concurrency limits are enforced at the task-run level. When a tagged task run starts, it acquires a slot; when it finishes (Completed/Failed/Cancelled/Crashed), the slot is released. - If no slot is available, the task run is put into an “AwaitingConcurrencySlot” state and does not count toward active slots until a slot becomes available. - The CLI shows the exact TaskRun IDs holding slots. For example:
Copy code
prefect concurrency-limit inspect <TAG>
The
active_slots
field in this output is a list of TaskRun IDs currently using the limit. - Pending/queued/waiting runs do not consume a slot. Only runs that have successfully acquired a slot do. When they exit to a terminal state, the slot is returned. - If a slot appears stuck (e.g., a run died unexpectedly), you can force-clear with:
Copy code
prefect concurrency-limit reset <TAG>
- Work pools (including Vertex AI) don’t change how active slots work—they’re enforced by the server regardless of where runs execute. Docs: - CLI reference (note the
active_slots
description): prefect concurrency-limit CLI - Guide: Global concurrency limits Would you like help inspecting a specific tag or clearing stuck slots?
j
@Marvin I have a concurrency limit of 50 in a deployment, but there are 20 flows running and it doesn't go further than that. How can I debug what's going on? It should be running 50 at a time
m
thought for 304.3 seconds
Got it — when you expect 50 concurrent flow runs but only see ~20, it’s almost always another limit upstream of the deployment that’s the bottleneck. In Prefect 3.x, the smallest of these wins: deployment limit, work pool limit, work queue limit, worker/infrastructure capacity, or external quotas (Vertex AI). Here’s a quick debug checklist you can run: 1) Verify the deployment and see its pool/queue - Confirm the deployment name and the pool/queue it targets:
Copy code
prefect deployment inspect "<flow_name>/<deployment_name>"
2) Check work pool concurrency and worker capacity - Make sure the pool doesn’t have a lower concurrency cap (e.g., 20):
Copy code
prefect work-pool inspect "<pool_name>"
Look for
concurrency_limit
and
online_workers
. If the pool is capped at 20, that would explain it. 3) Check work queue concurrency - Queues can also have their own cap:
Copy code
prefect work-queue inspect "<queue_name>" -p "<pool_name>"
If
concurrency_limit
is set to 20 here, you’ll be limited to 20. 4) Inspect the waiting runs’ states - See what the “other 30” runs are doing:
Copy code
prefect flow-run ls --deployment "<flow_name>/<deployment_name>" --state-name Pending,AwaitingWorker,AwaitingConcurrencySlot,Scheduled --limit 200
- If you see `AwaitingWorker`: workers/pool can’t pick up more work (pool or queue concurrency; or not enough worker capacity). - If you see
Scheduled
and they aren’t moving: worker polling or pool/queue limits again. - If you see `AwaitingConcurrencySlot`: that’s a global tag-based concurrency limit (task-level usually), not deployment-level. 5) Check for global concurrency limits (less likely for flow count, but worth confirming) - If you’ve configured global limits by tag:
Copy code
prefect concurrency-limit ls
and for a specific tag:
Copy code
prefect concurrency-limit inspect <TAG>
Docs: Global concurrency limits and prefect concurrency-limit CLI 6) Vertex AI-specific checks (since you’re on a Vertex work pool) - Verify your Vertex AI project/region quotas for concurrent custom jobs; a common default is around 20. If Vertex caps you at 20, Prefect will only have ~20 jobs running while additional runs sit in
AwaitingWorker
or keep retrying submission. - Check your worker logs (the process running
prefect worker start -p <pool>
) for submission errors like quota/resource-exhausted messages. If you share: - The output of
prefect work-pool inspect "<pool_name>"
, - The output of
prefect work-queue inspect "<queue_name>" -p "<pool_name>"
, - A snippet of
prefect flow-run ls ...
showing the waiting states, …I can pinpoint exactly which limit is capping you at 20 and recommend the fix (raising pool/queue concurrency, adding worker capacity, or adjusting Vertex quotas).