Based on the research and existing issues, I can explain why you're experiencing crashes instead of proper queuing. There are a few potential causes:
1.
Zombie Flow Runs: If a worker process is terminated abruptly, flow runs can become "stuck" in a running state, consuming concurrency slots without actually running. This can lead to what appears to be crashes when new flows try to run.
2.
Infrastructure vs Application-Level Concurrency: While work queue concurrency limits should queue flows, they operate at the application level. If your infrastructure (like Kubernetes or your execution environment) has stricter resource limits, flows might crash before the work queue concurrency mechanism can handle them.
3.
Concurrency Slot Release Issues: There's a known issue where concurrency slots might not be properly released when flows are cancelled or terminated unexpectedly, especially in containerized environments.
Here are some recommendations:
1.
Implement a Zombie Killer: Set up an automation to detect and handle "zombie" flow runs that are stuck in a running state. This is the official recommendation and is documented at
https://docs.prefect.io/v3/automate/events/automations-triggers#detect-and-respond-to-zombie-flows
2.
Graceful Shutdown Handling: Ensure your flows have proper error handling and cleanup mechanisms, especially if they're running in containerized environments.
3.
Monitor Concurrency Usage: Keep track of how many slots are actually being used versus what you expect. You might have "stuck" slots that need to be manually cleared.
To better diagnose your specific situation, it would be helpful to know:
- What infrastructure are you running on (Kubernetes, local machine, etc.)?
- What is your concurrency limit set to?
- Are you seeing any specific error messages when the flows crash?
- Are you using any custom task runners or execution configurations?