José Agretti
09/28/2025, 2:54 PMFile "/usr/local/lib/python3.12/site-packages/prefect/server/events/ordering.py", line 180, in preceding_event_confirmed
raise MaxDepthExceeded(event)
prefect.server.events.ordering.MaxDepthExceeded
when we run a high number of Prefect jobs (1000) with a concurrency limit of 50, what does it mean?Marvin
09/28/2025, 2:54 PMMarvin
09/28/2025, 2:56 PMMaxDepthExceeded
.
What it means and the impact
- Prefect Server uses event ordering to evaluate automations/triggers in the correct sequence. When a chain is too deep, the server skips evaluating that event for triggers and logs this error.
- Your runs still orchestrate, but any automations depending on precise event ordering for that specific event may not fire.
- This often correlates with “event thrash” (rapid or oscillating state transitions), e.g., repeated cancelling/cancelled flips, aggressive retries, or a long backlog of dependent events due to contention.
Why you might see it with 1000 runs and a concurrency limit of 50
- High contention can create many quick state transitions across the same resources as runs compete for capacity, which lengthens the event “follows” chains.
- If any run gets into a loop (e.g., repeated transitions), that single run can generate thousands of events and trigger this error repeatedly.
References and context
- Code path where the exception is raised: ordering.py
- Related investigation with stack traces and root-cause analysis of long event chains: GitHub issue #15399
What you can do next
- Check for noisy runs:
- In the UI, look for runs with a large number of state transitions or that seem “stuck” oscillating (e.g., Cancelling <-> Cancelled, or constant retries).
- If you’re self-hosting and comfortable with DB checks, look for a run/resource generating an unusually high count of events and resolve/stop it.
- Reduce event thrash:
- Avoid rapid cancel/retry loops; add backoff on retries.
- Revisit any automations that might be flipping states frequently.
- Consider whether your concurrency configuration is causing heavy contention; sometimes increasing concurrency slightly or smoothing submission bursts reduces churn.
- Upgrade to the latest Prefect 3.x:
- Several improvements have landed around event processing and state transitions. If you can share your exact Prefect version, I can confirm whether relevant fixes are included.
- If you’re not using Automations/Triggers:
- The error is mostly log noise; orchestration still works. But I’d still recommend addressing any runaway runs to prevent table growth and noisy logs.
If you can share:
- Your Prefect version (server and agents)
- Whether you’re using Automations/Triggers
- Whether you see repeated state flips (e.g., Cancelling/Cancelled) or unusually long-running “stuck” runs
…I can suggest more targeted steps.José Agretti
09/28/2025, 3:01 PMMarvin
09/28/2025, 3:01 PMMarvin
09/28/2025, 3:05 PMMaxDepthExceeded
and can cause automations on that event to be skipped. See the code path here: ordering.py and a related investigation: GitHub issue #15399.
Recommended adjustments
- Prefer “run a new deployment” over “change state to Scheduled”:
- On Crashed, trigger an action that starts a fresh flow run (Run deployment) and optionally cancels the crashed run. This breaks the long chain on a single resource because each retry is a new flow run.
- Add guardrails to avoid loops:
- Add a cooldown/delay before the restart so you don’t immediately re-crash.
- Cap restarts (e.g., only restart up to N times per window). If your automation can’t express that directly, consider moving to “Run deployment” with fixed parameters and enforce max-attempts in your flow code or via deployment parameters.
- Reduce event thrash:
- If many runs are competing for capacity (1000 runs, limit 50), consider smoothing submissions or slightly adjusting concurrency so runs don’t repeatedly bounce.
- Fix the root crash:
- Since you’re on Vertex AI work pools, crashes are often infra-related (preemptions, OOM, image mismatches). Given you’ve already hit architecture mismatch issues in Docker, also check:
- Memory/CPU on the Vertex AI worker spec
- Node preemption settings
- Image consistency (Python/Prefect versions match worker/server)
Notes
- The max depth is a server safety limit; the practical way to avoid it is to prevent very long causal chains on one run.
- Orchestration continues, but automations tied to those deep-chain events can be skipped when the cap is hit.
If you can share:
- Your Prefect version (server and workers)
- A snapshot of how the automation is configured
- Whether the same runs are repeatedly crashing (vs. lots of unique runs crashing once)
…I can suggest an exact automation configuration (e.g., switch to “Run deployment” + optional delay + cancel) that will minimize event-chain depth and avoid this error.José Agretti
09/28/2025, 3:41 PM{
"type": "event",
"match": {
"prefect.run-count": [
"0",
"1"
],
"prefect.resource.id": "prefect.flow-run.*"
},
"match_related": {
"prefect.resource.id": "prefect.deployment.*",
"prefect.resource.name": "my-deployment"
},
"after": [],
"expect": [
"prefect.flow-run.Crashed"
],
"for_each": [
"prefect.resource.id"
],
"posture": "Reactive",
"threshold": 1,
"within": 30
}
Ideally if I can use Run deployment instead of Change flow run's state, could be betterJosé Agretti
09/28/2025, 3:42 PMMarvin
09/28/2025, 3:42 PMMarvin
09/28/2025, 3:43 PMRunDeployment
, ChangeFlowRunState
, and CancelFlowRun
there.
- Add guardrails to prevent loops:
- Put a reasonable upper bound on restarts (e.g., move the retry policy into the flow via retries
and backoff). Bounded retries keep the per-run event chain short.
- If you must restart via automation, limit the scope (specific deployment, tag, or flow) and consider a human notification after N restarts.
- Fix the root crash cause:
- Since you’re on Vertex AI work pools, double‑check worker image consistency, resource limits (CPU/memory), and preemption. Your past architecture mismatch issues are a common crash source.
Concrete automation change
- Current: Trigger = flow-run enters Crashed; Action = Change state to Scheduled (on the same run) → causes long chains.
- Proposed: Trigger = flow-run enters Crashed; Actions:
1) Cancel flow run
2) Run deployment with source = inferred (i.e., “run the same deployment again”)
Example shape (conceptual)
{
"trigger": {
"event": "prefect.flow-run.Crashed"
},
"filters": {
"resource": {
"kind": "prefect.flow-run"
}
},
"actions": [
{
"type": "cancel-flow-run"
},
{
"type": "run-deployment",
"source": "inferred",
"parameters": {
"restart": true
}
}
]
}
Useful references
- Where the error is raised: events/ordering.py
- Related investigation that ties long event chains to loops: GitHub issue #15399
- Automation actions API overview: Automations actions (docs)
- Automations model endpoints: Automations models (docs)
If you share:
- Your Prefect version (server and workers)
- A quick screenshot or JSON export of the automation
- Whether the same runs repeatedly crash vs many different runs
…I can suggest an exact configuration (including scoping/filters and an example “Run deployment (inferred)” setup) that will minimize event-chain depth and stop the MaxDepthExceeded spam.José Agretti
09/28/2025, 3:58 PMMarvin
09/28/2025, 3:58 PMresearch_prefect_topic
📊 Tools used:
• research_prefect_topic
(1x)José Agretti
09/29/2025, 6:54 PMMarvin
09/29/2025, 6:54 PMMarvin
09/29/2025, 6:57 PMprefect concurrency-limit inspect <TAG>
The active_slots
field in this output is a list of TaskRun IDs currently using the limit.
- Pending/queued/waiting runs do not consume a slot. Only runs that have successfully acquired a slot do. When they exit to a terminal state, the slot is returned.
- If a slot appears stuck (e.g., a run died unexpectedly), you can force-clear with:
prefect concurrency-limit reset <TAG>
- Work pools (including Vertex AI) don’t change how active slots work—they’re enforced by the server regardless of where runs execute.
Docs:
- CLI reference (note the active_slots
description): prefect concurrency-limit CLI
- Guide: Global concurrency limits
Would you like help inspecting a specific tag or clearing stuck slots?José Agretti
09/29/2025, 7:17 PMMarvin
09/29/2025, 7:17 PMMarvin
09/29/2025, 7:22 PMprefect deployment inspect "<flow_name>/<deployment_name>"
2) Check work pool concurrency and worker capacity
- Make sure the pool doesn’t have a lower concurrency cap (e.g., 20):
prefect work-pool inspect "<pool_name>"
Look for concurrency_limit
and online_workers
. If the pool is capped at 20, that would explain it.
3) Check work queue concurrency
- Queues can also have their own cap:
prefect work-queue inspect "<queue_name>" -p "<pool_name>"
If concurrency_limit
is set to 20 here, you’ll be limited to 20.
4) Inspect the waiting runs’ states
- See what the “other 30” runs are doing:
prefect flow-run ls --deployment "<flow_name>/<deployment_name>" --state-name Pending,AwaitingWorker,AwaitingConcurrencySlot,Scheduled --limit 200
- If you see `AwaitingWorker`: workers/pool can’t pick up more work (pool or queue concurrency; or not enough worker capacity).
- If you see Scheduled
and they aren’t moving: worker polling or pool/queue limits again.
- If you see `AwaitingConcurrencySlot`: that’s a global tag-based concurrency limit (task-level usually), not deployment-level.
5) Check for global concurrency limits (less likely for flow count, but worth confirming)
- If you’ve configured global limits by tag:
prefect concurrency-limit ls
and for a specific tag:
prefect concurrency-limit inspect <TAG>
Docs: Global concurrency limits and prefect concurrency-limit CLI
6) Vertex AI-specific checks (since you’re on a Vertex work pool)
- Verify your Vertex AI project/region quotas for concurrent custom jobs; a common default is around 20. If Vertex caps you at 20, Prefect will only have ~20 jobs running while additional runs sit in AwaitingWorker
or keep retrying submission.
- Check your worker logs (the process running prefect worker start -p <pool>
) for submission errors like quota/resource-exhausted messages.
If you share:
- The output of prefect work-pool inspect "<pool_name>"
,
- The output of prefect work-queue inspect "<queue_name>" -p "<pool_name>"
,
- A snippet of prefect flow-run ls ...
showing the waiting states,
…I can pinpoint exactly which limit is capping you at 20 and recommend the fix (raising pool/queue concurrency, adding worker capacity, or adjusting Vertex quotas).