<@ULVA73B9P> I'm getting several errors like this ...
# ask-marvin
j
@Marvin I'm getting several errors like this one in our flows
Copy code
Exception in callback amaintain_concurrency_lease.<locals>.handle_lease_renewal_failure(<Task cancell...leases.py:17>>) at /app/.venv/lib/python3.12/site-packages/prefect/concurrency/_leases.py:109
handle: <Handle amaintain_concurrency_lease.<locals>.handle_lease_renewal_failure(<Task cancell...leases.py:17>>) at /app/.venv/lib/python3.12/site-packages/prefect/concurrency/_leases.py:109>
What could be the cause?
m
thought for 113.6 seconds
n
hi @José Agretti - can you share more of the stack trace?
m
Short answer: it’s usually benign. This log appears when the background “concurrency lease renewal” task is cancelled as your run finishes, and Python 3.12 raises a CancelledError that Prefect logs as an “Exception in callback…”. Your flow results aren’t affected. Details: - Prefect keeps concurrency “leases” alive in a background task while a run is active. When the run ends (or the process shuts down), Prefect cancels that renewal task. - On Python 3.12, calling
future.exception()
on a cancelled future raises
CancelledError
, which produces the callback error you’re seeing. - There’s an open report that matches your trace and version pattern: Future.CancelledError when flow finished. When to worry: - If you also see messages like “Concurrency lease renewal failed - slots are no longer reserved…”, that indicates an actual renewal failure (e.g., transient API/network issue) during the run. In that case, concurrency limits may be exceeded for the remainder of the run. Concurrency API docs: - Renew concurrency lease (Server API) - Concurrency limits v2 Python client What you can do: - Make sure you’re on the latest 3.x (there have been concurrency/logging improvements; even if this specific noise isn’t called out, staying current helps). Full changelog: 3.4.19 → 3.4.20. - If it’s just noisy logs, you can suppress them: - In code:
Copy code
import logging
    logging.getLogger("prefect._internal.concurrency").setLevel(logging.CRITICAL)
- Or via a logging config file and
PREFECT_LOGGING_SETTINGS
. - If you suspect real renewal failures (not just end-of-run noise), check for those “lease renewal failed” warnings around the same timestamps and verify worker connectivity to the API (especially relevant if Vertex AI workers are scaling down or preempting pods). Could you share: - Your Prefect version and Python version (looks like 3.12 from the path)? - Whether runs are actually failing or if it’s only this log at shutdown? This will help confirm if it’s the benign cancellation case or something we should dig into (network/API, clock skew, etc.).
j
Hi Nate, yes, let me see if I can get more details
Just for the sake of the answer @Marvin, I do see
Concurrency lease renewal failed - slots are no longer reserved. Execution will continue, but concurrency limits may be exceeded.
in the flow logs as well
m
thought for 198.2 seconds
Thanks for confirming — that message means it’s not just the benign shutdown noise. It indicates the background lease-renewal call actually failed during the run, so your run kept going but its concurrency slots were released, and subsequent work may exceed your configured limits. Most common causes - Transient network/API errors between your runtime and the Prefect API (timeouts, DNS, NAT exhaustion, brief egress interruptions). This is common on ephemeral/cloud jobs like Vertex AI. - Lease expired before it could be renewed (e.g., VM pause/preemption, long GC stop-the-world, heavy load delaying the renewal loop). - Concurrency limit was deleted/renamed/reset while a lease was active. - Auth issues during the run (rotated/expired API key) or API errors (401/403/5xx/429). - Clock skew that makes the server think the lease TTL has passed. References - GitHub issue matching your trace: Future.CancelledError when flow finished (also discusses the noisy Python 3.12 cancellation at shutdown) - REST API: Renew concurrency lease: docs - How-to: Global concurrency limits: guide What I recommend 1) Upgrade Prefect - Make sure your workers/flow images are on the latest 3.x patch. There have been several concurrency/logging improvements. Changelog sample: 3.4.19 → 3.4.20. 2) Harden your Vertex AI runtime - Ensure stable outbound internet/NAT. If you run many concurrent jobs, monitor for NAT port exhaustion or aggressive idle timeouts. - Avoid preemptible instances for runs that must hold leases reliably. - Keep system time in sync (ensure NTP is running) to avoid lease TTL drift. 3) Don’t modify limits during active runs - Avoid deleting/renaming concurrency limits that have active holders. 4) If you want strict enforcement - Have the run fail fast instead of continuing without a lease by using
strict=True
in the concurrency context. That way, if a renew fails, the run will terminate to prevent over-allocation. - Async:
Copy code
from prefect import flow
    from prefect.concurrency.asyncio import concurrency

    @flow
    async def my_flow():
        async with concurrency(names="my-limit", strict=True):
            ...
- Sync:
Copy code
from prefect import flow
    from prefect.concurrency.sync import concurrency

    @flow
    def my_flow():
        with concurrency(names="my-limit", strict=True):
            ...
5) Look for the root exception - Just before the “lease renewal failed…” log, there’s usually a more specific error (e.g., httpx timeout or 40x/50x). That will pinpoint whether you’re hitting auth, network, or server-side issues. - You can temporarily raise verbosity for the concurrency logger to capture more detail:
Copy code
import logging
  logging.getLogger("prefect.concurrency").setLevel(logging.DEBUG)
  logging.getLogger("prefect._internal.concurrency").setLevel(logging.DEBUG)
If you can share: - Your Prefect and Python versions, and whether Vertex AI jobs are preemptible - A snippet of the logs around the first “lease renewal failed” message (including any httpx/HTTP errors) I can help narrow this down further and suggest a targeted fix.
j
I can't find the specific logs of that exception, I'm seeing a lot of these in the logs
Copy code
File "/app/.venv/lib/python3.12/site-packages/prefect/_internal/concurrency/calls.py", line 215, in _invoke_callbacks
    callback(self)
  File "/app/.venv/lib/python3.12/site-packages/prefect/concurrency/_leases.py", line 64, in handle_lease_renewal_failure
    exc = future.exception()
          ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 483, in exception
    raise CancelledError()
not sure if it's exactly the same
I just upgraded Prefect to 3.4.20 by the way, only the server for now, I'll upgrade the clients shortly
This is the message
Copy code
16:08:01.748 | ERROR   | MainThread   | prefect._internal.concurrency - exception calling callback for <Future at 0x7f1976ebec30 state=cancelled>"
And I upgraded Prefect because I saw this issue https://github.com/PrefectHQ/prefect/issues/18627
I upgraded the clients to
3.4.20
, and now I got this error (only 1 for now)
Copy code
flow_name: "flow"
flow_run_id: "flow_id"
flow_run_name: "flow name"
funcName: "handle_lease_renewal_failure"
lineno: 123
logger: "prefect.flow_runs"
message: "Concurrency lease renewal failed - slots are no longer reserved. Terminating execution to prevent over-allocation."
module: "_leases"
timestamp: "2025-10-03 15:06:43,292"
And the previous log is this one
Copy code
funcName: "_send_single_request"
lineno: 1740
logger: "httpx"
message: "HTTP Request: POST <https://prefect-server-986334413485.us-central1.run.app/api/v2/concurrency_limits/leases/bb237253-830c-4bde-9b03-a1b62899856f/renew> "HTTP/1.1 404 Not Found""
module: "_client"
taskName: "Task-21"
timestamp: "2025-10-03 15:06:43,291"
The Prefect server is running in a Cloud Run service
Actually it just happened again, when I retried the flow
@Marvin I upgraded the clients to
3.4.20
, and now I got this error (only 1 for now)
Copy code
flow_name: "flow"
flow_run_id: "flow_id"
flow_run_name: "flow name"
funcName: "handle_lease_renewal_failure"
lineno: 123
logger: "prefect.flow_runs"
message: "Concurrency lease renewal failed - slots are no longer reserved. Terminating execution to prevent over-allocation."
module: "_leases"
timestamp: "2025-10-03 15:06:43,292"
And the previous log is this one
Copy code
funcName: "_send_single_request"
lineno: 1740
logger: "httpx"
message: "HTTP Request: POST <https://prefect-server-986334413485.us-central1.run.app/api/v2/concurrency_limits/leases/bb237253-830c-4bde-9b03-a1b62899856f/renew> "HTTP/1.1 404 Not Found""
module: "_client"
taskName: "Task-21"
timestamp: "2025-10-03 15:06:43,291"
The Prefect server is running in a Cloud Run service. I disabled a global concurrency limit to see if we can get the flows running correctly for now
m
thought for 184.9 seconds
Thanks for the detail. In Prefect 3.x, that pair of logs means: - The client tried to renew its concurrency lease - The server returned 404 for that lease ID - The client then aborted the run to avoid over-allocating slots: "Concurrency lease renewal failed - slots are no longer reserved. Terminating execution to prevent over-allocation." A 404 on renew almost always means “the lease no longer exists on the server”. Common causes: - The concurrency limit was deleted or reset while the run was still active (this immediately removes active leases). - The lease expired before the client could renew (e.g., long pause, container hiccup, clock skew, or transient network delay). - Server restart or cleanup raced with renewal. - Known race: runs that have effectively finished can still hit a late renewal and get a 404; see this report for 3.4.17: Completed runs intermittently marked crashed on lease renewal failure. Your note about disabling the global concurrency limit could explain the 404 for any already-running flows that were still trying to renew their leases. Once the limit is removed/reset, existing lease IDs will 404 and you’ll see exactly this behavior. Questions to narrow down: - Is your Prefect Server also on 3.4.20, or only the clients? Version skew can increase the likelihood of lease timing/race issues. If possible, keep server and clients aligned. - Did you delete/reset the limit right around the timestamp of the error? - Did the flow actually crash mid-run, or had it already finished most work and then crashed near the end? If it’s the latter, it may match the known race above. Recommended actions - Keep server and clients on the same 3.4.x version. If you can, upgrade Server to match 3.4.20 (or 3.4.21+) and redeploy your Cloud Run service. - Avoid modifying/deleting concurrency limits while flows using them are in-flight. Drain first if you need to make changes. - Recreate and verify your limit:
Copy code
prefect concurrency-limit ls
  prefect concurrency-limit inspect <your-tag>
- If you didn’t change the limit and still see this, check the server logs around that time for lease expiry/cleanup and network blips from your Vertex AI workers to the Cloud Run URL. Short-lived CPU throttling or pauses can cause missed renewals. - If you see flows that finish successfully but get marked crashed due to this renewal race, that matches the linked issue. Upgrading to the latest 3.4.x may help; if it persists, we can dig in with logs. References - REST endpoint the client is hitting: Renew concurrency lease - Related bug report: Completed Flow runs intermittently marked as Crashed on lease renewal failure If you can share: - Your Server version and image tag deployed to Cloud Run - Whether the limit was reset/deleted around 150643 - Whether the run had mostly completed or truly failed mid-execution …I can help confirm root cause and the best mitigation.
j
@Marvin is there a way to force a run to start? I'm having AwaitingConcurrencySlot in flow runs, and they can't start. I disabled the concurrency limits
m
thought for 296.7 seconds
Short answer: yes. You can either clear any leftover leases so the server re-admits the runs, or you can force a run to Running and bypass orchestration. First, make sure no limits are still gating you - Tag-based limits
Copy code
prefect concurrency-limit ls
  prefect concurrency-limit inspect <tag>
  prefect concurrency-limit reset <tag>   # clears stuck/ghost leases
- Global limits (since you disabled them, just verify)
Copy code
prefect global-concurrency-limit ls
- Work pool concurrency (often overlooked)
Copy code
prefect work-pool inspect <pool-name>
  prefect work-pool clear-concurrency-limit <pool-name>
Force a specific flow run to start - Programmatically, you can override orchestration and move a run to Running:
Copy code
from prefect.client.orchestration import get_client
  from prefect.states import Running
  import anyio

  async def force_start(flow_run_id: str):
      async with get_client() as client:
          await client.set_flow_run_state(flow_run_id=flow_run_id, state=Running(), force=True)

  anyio.run(force_start, "<FLOW_RUN_ID>")
This is a blunt tool: it skips concurrency checks for that run. If they’re still stuck in AwaitingConcurrencySlot after you’ve disabled limits - There may be lingering leases that didn’t get released cleanly. Use
prefect concurrency-limit reset <tag>
for any relevant tags; that immediately frees all slots. - Optionally nudge the run by re-scheduling it:
Copy code
from datetime import datetime, timezone
  from prefect.client.orchestration import get_client
  from prefect.states import Scheduled
  import anyio

  async def reschedule(flow_run_id: str):
      async with get_client() as client:
          state = Scheduled(scheduled_time=datetime.now(timezone.utc))
          await client.set_flow_run_state(flow_run_id=flow_run_id, state=state, force=True)

  anyio.run(reschedule, "<FLOW_RUN_ID>")
The server will re-evaluate concurrency on the next transition; with limits cleared, it should proceed to Running. If you share: - the tag(s) applied to the runs, - your work pool name (Vertex AI) and whether it has a pool-level concurrency limit, - and a sample flow_run_id, …I can suggest the exact
inspect/reset
commands to clear the blockage.
j
@Marvin I removed a concurrency limit from the deployment, and the flows started to run. Are the deployment concurrency limits still valid? Because I see them as global concurrency limits now, and I already reset them previously, but the only way to make it work was when I set to Unlimited in the deployment
m
thought for 135.3 seconds
Short answer: yes—deployment-level concurrency limits are still valid in Prefect 3.x. What you’re seeing is expected: deployment limits are enforced via a managed “global concurrency limit” entry under the hood, so they show up on the Global Concurrency Limits page/CLI even though they only apply to that deployment. Why resetting didn’t help earlier - The
prefect concurrency-limit reset
command only affects tag-based limits, not deployment-level limits. - Deployment-level limits are surfaced as global concurrency limits; you manage them via the Global Concurrency Limits UI/CLI or by editing the deployment. - When you set the deployment to Unlimited, you removed that managed limit—which is why your runs started. How to manage these cleanly - If you want per-deployment control: set
concurrency_limit
on the deployment and avoid overlapping tag/global limits for the same workloads. - If you prefer workspace-wide/global control: set the deployment to Unlimited and manage capacity solely with global or tag limits. Useful commands - List global limits (you should see the deployment-managed entry):
Copy code
prefect global-concurrency-limit ls
- Disable or change the managed entry directly:
Copy code
prefect global-concurrency-limit disable <name-or-id>
  prefect global-concurrency-limit update <name-or-id> --limit 5
- Verify no tag-based limits are also gating your runs:
Copy code
prefect concurrency-limit ls
Tip for stuck runs in AwaitingConcurrencySlot - After you update/disable the relevant limit, the server should admit runs automatically. If some remain stuck, you can nudge them by rescheduling or forcing Running as we discussed earlier. If you share the deployment name (and the Global Concurrency Limits entries you see), I can point you to the exact entry to disable/update so you don’t need to keep the deployment at Unlimited unless that’s your goal.
j
@Marvin But when I disabled the Global concurrency limits, the flows didn't run, just when I set the Concurrency limit of the deployment to Unlimited things started to run again
m
thought for 45.4 seconds
You’re right to notice the difference. What you hit is layering: - Deployment-level concurrency is its own gate. If a deployment has
concurrency_limit
set, that limit alone can keep runs in AwaitingConcurrencySlot even if: - All tag-based limits are reset (
prefect concurrency-limit …
) - Other Global Concurrency Limits are disabled Why your runs only started when you set the deployment to Unlimited - Those runs were blocked by the deployment’s own concurrency limit (a deployment-scoped resource). Resetting tag limits and disabling unrelated global limits won’t touch it. - In 3.x, deployment concurrency is enforced independently; it’s implemented using the same “global resource” mechanism under the hood, but it’s tied specifically to that deployment. If you don’t disable/update that specific deployment-managed entry (or set the deployment to Unlimited), the block remains. How to verify and manage the right thing - Inspect the deployment to see its configured concurrency:
Copy code
prefect deployment inspect "<flow_name>/<deployment_name>"
If
concurrency_limit
is present, it’s active and will gate that deployment’s runs. - Manage the actual entry that enforces it: - In the UI, open the deployment and adjust Concurrency Limit (set a higher number or Unlimited). - Or redeploy with the desired setting: - Keep it: set a numeric
concurrency_limit
- Remove it: set
concurrency_limit=None
(Unlimited) - Be aware that: -
prefect concurrency-limit *
controls tag-based task limits only (won’t unblock deployment-level flow gating). -
prefect global-concurrency-limit *
controls workspace-level resource locks; you’d need to identify the specific entry tied to your deployment if you want to manage it there. Disabling other global entries won’t affect a deployment’s own limit. - Work pool concurrency can also gate starts:
Copy code
prefect work-pool inspect <pool-name>
    prefect work-pool clear-concurrency-limit <pool-name>
If you’d like, share the deployment name and I can help pinpoint the exact managed entry that corresponds to it so you can adjust it without setting the deployment to Unlimited. For reference on global limits, see Global concurrency limits.