< Marvin> sometimes when we run some flows at the very end o Prefect Community #ask-marvin

<@ULVA73B9P> sometimes when we run some flows, at ...

José Agretti

09/25/2025, 12:32 PM

@Marvin sometimes when we run some flows, at the very end of everything when the process already finished, we get this Prefect error

Copy code

12:15:56.924 | ERROR   | MainThread   | prefect._internal.concurrency - exception calling callback for <Future at 0x7f58ca4fc140 state=cancelled>

Copy code

File "/app/.venv/lib/python3.12/site-packages/prefect/_internal/concurrency/calls.py", line 215, in _invoke_callbacks
    callback(self)
  File "/app/.venv/lib/python3.12/site-packages/prefect/concurrency/_leases.py", line 64, in handle_lease_renewal_failure
    exc = future.exception()
          ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/concurrent/futures/_base.py", line 483, in exception
    raise CancelledError()

Do you know what this is?

Marvin

09/25/2025, 12:32 PM

✅ thought for 209.2 seconds

Marvin

09/25/2025, 12:35 PM

Yes — this is a benign log that can appear at shutdown. What’s happening: - Prefect maintains a background “lease renewal” task when concurrency limits are in use (global or task-level). - When your flow finishes, Prefect cancels that background renewal task. - The callback that runs afterward calls

future.exception()

, but for a cancelled future that raises

CancelledError

, which Prefect then logs as:

prefect._internal.concurrency - exception calling callback for <Future ... state=cancelled>

This does not affect your flow outcome; it’s just noisy. It’s been reported by others (often on Python 3.12/3.13) and we’re tracking it: - Future.CancelledError when flow finished - Exception in callback handle_lease_renewal_failure How to reduce/suppress the noise: - Upgrade to the latest Prefect 3.x; we’re improving shutdown behavior around concurrency leases. - If it still shows up and you want to silence it, you can raise the level for that specific logger: Option A: in code (early in your process)

Copy code

import logging
logging.getLogger("prefect._internal.concurrency").setLevel(logging.CRITICAL)

Option B: via a logging config file (set

PREFECT_LOGGING_SETTINGS_PATH

to this YAML)

Copy code

version: 1
disable_existing_loggers: false
formatters:
  simple:
    format: '%(asctime)s | %(levelname)s | %(name)s - %(message)s'
handlers:
  console:
    class: logging.StreamHandler
    stream: <ext://sys.stdout>
    formatter: simple
loggers:
  prefect._internal.concurrency:
    level: CRITICAL
    handlers: [console]
    propagate: false
root:
  level: INFO
  handlers: [console]

A couple quick questions to confirm: - Which Prefect version and Python version are you running? - Are you using global/tag-based concurrency limits or task

concurrency

settings in these flows? That will help me confirm if you’re hitting the known concurrency-lease shutdown path or if there’s something else at play.

José Agretti

09/25/2025, 12:38 PM

@Marvin great, thanks for the references. And what about this error? It also happens once in a while

Copy code

ERROR   | prefect.FlowRunCancellingObserver - Error consuming events"

at the end of running a flow, followed by

Copy code

File "/app/.venv/lib/python3.12/site-packages/prefect/runner/_observers.py", line 56, in __aexit__
    await self._consumer_task
  File "/app/.venv/lib/python3.12/site-packages/prefect/runner/_observers.py", line 27, in _consume_events
    async for event in self._events_subscriber:
  File "/app/.venv/lib/python3.12/site-packages/prefect/events/clients.py", line 642, in __anext__
    message = orjson.loads(await self._websocket.recv())
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.venv/lib/python3.12/site-packages/websockets/asyncio/connection.py", line 322, in recv
    raise self.protocol.close_exc from self.recv_exc
websockets.exceptions.ConnectionClosedError: no close frame received or sent

Marvin

09/25/2025, 12:38 PM

✅ thought for 75.2 seconds

Marvin

09/25/2025, 12:39 PM

Short answer: it’s a harmless shutdown blip. What it is: - The FlowRunCancellingObserver listens on a WebSocket for cancellation-related events during your flow run. - When the flow ends (or the container is shutting down), that WebSocket can be closed abruptly by the runtime/environment without a proper close frame. The consumer task then hits: websockets.exceptions.ConnectionClosedError: no close frame received or sent - You’ll see it logged as: prefect.FlowRunCancellingObserver - Error consuming events and a traceback through

_observers.py

and

events/clients.py

. This does not affect your flow result and typically occurs right at shutdown. Why it happens: - Normal end-of-process teardown closes the socket while the consumer is still awaiting a message. - Network layers/load balancers/proxies (and managed runtimes like Vertex AI) sometimes drop connections without a close frame when the container exits or on idle timeouts. - It’s intermittent; you’ll notice it more in short-lived containers or environments that aggressively terminate connections. Mitigations: - Upgrade to the latest Prefect 3.x; we’ve been improving shutdown handling around observers/events. - If it only ever shows at the end and results are fine, it’s safe to ignore. To silence the noise: - In code:

Copy code

import logging
    logging.getLogger("prefect.FlowRunCancellingObserver").setLevel(logging.CRITICAL)

- Or via a logging config (set

PREFECT_LOGGING_SETTINGS_PATH

Copy code

version: 1
    disable_existing_loggers: false
    handlers:
      console:
        class: logging.StreamHandler
        stream: <ext://sys.stdout>
    loggers:
      prefect.FlowRunCancellingObserver:
        level: CRITICAL
        handlers: [console]
        propagate: false
    root:
      level: INFO
      handlers: [console]

- If you see it mid-run (not just at the very end), check for: - Network/proxy timeouts or middleboxes closing WebSockets prematurely (raise idle/connection timeouts if you can). - Container runtime sending early SIGTERM/SIGKILL on job completion hooks. - Worker or flow run environment losing outbound connectivity. References: - Observer implementation: prefect/runner/_observers.py - Events WebSocket client: prefect/events/clients.py - WebSockets exception docs: websockets.exceptions.ConnectionClosedError - Prefect logging customization: Logging customization Quick checks I’d love to confirm: - Prefect and Python versions you’re on. - Whether you’re on Prefect Cloud or a self-hosted Server. - Any proxies/load balancers in front of your workers (Vertex AI networking, sidecars, etc.). If you share those, I can suggest tighter remediation if needed.

José Agretti

09/25/2025, 3:37 PM

@Marvin I'm running a self host prefect server in cloud run, and p95 metric of request latency is 1.24 minutes. That sounds high, and I'm not sure why. I'm running it only as Prefect API, with the

--no-services

flag

Marvin

09/25/2025, 3:37 PM

✅ thought for 226.1 seconds

Marvin

09/25/2025, 3:41 PM

Short answer: that p95 is almost certainly inflated by long‑lived event/websocket subscriptions, not “slow API” handlers. Why this happens on Cloud Run - Prefect API uses WebSockets for event subscriptions (e.g., flow/task run observers, UI live updates). Those connections can stay open for minutes. - Cloud Run measures request latency from request start until the response completes. For long‑lived streaming (SSE) and sometimes WebSocket upgrade/holding, Cloud Run can attribute the full open duration to request latency. That pushes p95 into multi‑minute territory even when normal REST endpoints are fast. - You’re running API-only with --no-services, which is fine; services don’t affect per-request latency. The event subscriptions are part of the API process. How to confirm - In Cloud Logging, look for requests with: - status=101 (Switching Protocols) for WebSocket upgrades, and/or - paths related to events/subscriptions that stay open with status=200 for a long time. - You’ll likely see many minutes-long requests corresponding to event subscriptions. - Create a log-based metric or dashboard that excludes: - status=101, and - event subscription paths, then look at p95 — it should drop to a “normal” value. What to do - Keep those long-lived connections; they’re expected. Instead, build a “real API latency” view by excluding: - WebSocket/SSE traffic (status=101 or event subscription endpoints) - Any streaming endpoints - If you need lower contention: - Set min instances > 0 and CPU “always allocated” on Cloud Run to reduce cold starts and CPU throttling. - Tune container concurrency to a sane value (e.g., 20–40) so long-lived sockets don’t starve CPU for regular REST calls. - Ensure your DB is healthy: use a connection pool (Cloud SQL Connector/PgBouncer if on Cloud SQL), and monitor slow queries. If you observe genuinely slow REST calls, enable Postgres slow query logs and check indexes on flow_runs/task_runs/events tables. Useful references - Cloud Run logging/monitoring overview: Cloud Run monitoring/logging - Prefect self-hosting: Self-hosted Prefect - Prefect Server concepts: Server concepts If you share: - which specific Cloud Run metric you’re using (and any filters), - a sample of high-latency log entries (status/path), - your DB backend and connection method, I can help you craft the exact Logging filter and a log-based metric to track “true” API latency excluding event streams.

4 Views

Open in Slack

Previous Next