<@ULVA73B9P> several of our deployments have sudde...
# ask-marvin
t
@Marvin several of our deployments have suddenly just started cancelling all future scheduled flows. The logs only say following: Worker 'KubernetesWorker 1109beca-3a04-43fa-b6ef-83c7e01da87e' submitting flow run '06889a14-556c-7732-8000-03abe289257b' 084953 AM Info Running on worker id: 3a1da522-6a89-4bdf-9d40-a815bb4450ef. See worker logs here: https://app.prefect.cloud/account/f4b6eb8f-f4d5-47a2-8cea-c7a5f1eca634/workspace/f5a40bc8-5409-48aa-a8df-3ad0a1baf84b/work-pools/work-pool/low-resource-workpool/worker/3a1da522-6a89-4bdf-9d40-a815bb4450ef 084953 AM Info Aborted submission of flow run '06889a14-556c-7732-8000-03abe289257b': Server returned a non-pending state 'CANCELLED' I am not sure why and I cannot se much in the worker logs either (i.e. no errors/warnings) It started yesterday around 8 AM.
m
thought for 46.7 seconds
Based on the research and the symptoms you're describing, there are several potential causes we should investigate: 1. Worker Race Conditions: Are you running multiple workers for the same work pool? If yes, there might be race conditions where multiple workers try to pick up the same flow run. 2. Infrastructure Issues: Since you're using Kubernetes workers: - Check if there have been any changes to your Kubernetes cluster configuration - Verify if there are any resource constraints or quota issues - Check if the Kubernetes pods are being terminated prematurely 3. Worker Configuration: - Check if there were any changes to your worker or work pool configuration around yesterday 8 AM - Verify the worker's permissions and access to the necessary resources To help diagnose this, could you please provide: 1. How many workers do you have running for this work pool? 2. Have there been any recent changes to your infrastructure or deployment configurations? 3. Can you check the Kubernetes events (using
kubectl get events
) around the time when this started happening? Also, I recommend checking: 1. The Prefect Cloud UI for any work pool configuration changes 2. Kubernetes pod events and logs for the affected deployments 3. Resource utilization metrics around the time this started happening This behavior starting suddenly at a specific time (yesterday at 8 AM) suggests there might have been a change or event that triggered this. Looking at your infrastructure logs and events around that specific time would be particularly valuable. Would you be able to provide any of this additional information to help narrow down the cause?
t
@Marvin I can see the cancel message is:
Copy code
"prefect.resource.id": "prefect.flow-run.06889d3d-4863-79bd-8000-d19ee1fe2a40",
    "prefect.resource.name": "imposing-centipede",
    "prefect.run-count": "0",
    "prefect.state-message": "Deployment concurrency limit reached.",
    "prefect.state-name": "Cancelled",
    "prefect.state-timestamp": "2025-07-30T08:12:09.872441+00:00",
    "prefect.state-type": "CANCELLED"
But there are not flows running for this deployment, but they are still cancelled?
m
thought for 111.0 seconds
You can check your deployment's current settings using:
Copy code
bash
prefect deployment inspect "your-deployment-name"
2. Immediate Solutions: - Try pausing and then resuming the deployment to reset its state - If possible, temporarily increase the concurrency limit to see if it resolves the issue - Check if there are any "zombie" flow runs stuck in a transitional state 3. Root Cause Investigation: - Look for any flow runs that might have terminated abnormally (like pod OOMKills or sudden terminations) around yesterday 8 AM - Check if there were any deployment updates or changes around that time - Verify if there are any flow runs marked as "Running" in the UI that aren't actually running Could you: 1. Share what concurrency limit is set for this deployment (if any)? 2. Check if there are any flow runs stuck in "Running" state in the UI? 3. Confirm if this is happening to all deployments or just specific ones? This issue typically requires some cleanup of the concurrency tracking state. If you need to reset it completely, you might need to: 1. Pause the deployment 2. Cancel any pending runs 3. Wait a few minutes 4. Resume the deployment Would you like me to provide more specific steps for any of these approaches?