<@ULVA73B9P> we are getting `asyncio.exceptions.Ti...
# ask-marvin
p
@Marvin we are getting
asyncio.exceptions.TimeoutError: timed out during opening handshake
when starting multiple deployments from a flow:
Copy code
2025-09-29T15:56:43.043+03:00 12:56:43.042 | DEBUG   | prefect.events.clients - Reconnecting...
12:56:43.042 | DEBUG | prefect.events.clients - Reconnecting...
2025-09-29T15:56:53.056+03:00 Traceback (most recent call last):
2025-09-29T15:56:53.056+03:00 File "/app/venv_deps/lib/python3.9/site-packages/websockets/asyncio/client.py", line 541, in __await_impl__
2025-09-29T15:56:53.056+03:00 self.connection = await self.create_connection()
2025-09-29T15:56:53.056+03:00 File "/app/venv_deps/lib/python3.9/site-packages/websockets/asyncio/client.py", line 467, in create_connection
2025-09-29T15:56:53.056+03:00 _, connection = await loop.create_connection(factory, **kwargs)
2025-09-29T15:56:53.056+03:00 File "/usr/local/lib/python3.9/asyncio/base_events.py", line 1050, in create_connection
2025-09-29T15:56:53.056+03:00 sock = await self._connect_sock(
2025-09-29T15:56:53.056+03:00 File "/usr/local/lib/python3.9/asyncio/base_events.py", line 961, in _connect_sock
2025-09-29T15:56:53.056+03:00 await self.sock_connect(sock, address)
2025-09-29T15:56:53.056+03:00 File "/usr/local/lib/python3.9/asyncio/selector_events.py", line 500, in sock_connect
2025-09-29T15:56:53.056+03:00 return await fut
2025-09-29T15:56:53.056+03:00 asyncio.exceptions.CancelledError
2025-09-29T15:56:53.056+03:00 During handling of the above exception, another exception occurred:
2025-09-29T15:56:53.056+03:00 Traceback (most recent call last):
2025-09-29T15:56:53.056+03:00 File "/app/venv_deps/lib/python3.9/site-packages/websockets/asyncio/client.py", line 574, in __await_impl__
2025-09-29T15:56:53.056+03:00 raise SecurityError(f"more than {MAX_REDIRECTS} redirects")
2025-09-29T15:56:53.056+03:00 File "/app/venv_deps/lib/python3.9/site-packages/websockets/asyncio/async_timeout.py", line 181, in __aexit__
2025-09-29T15:56:53.056+03:00 self._do_exit(exc_type)
2025-09-29T15:56:53.056+03:00 File "/app/venv_deps/lib/python3.9/site-packages/websockets/asyncio/async_timeout.py", line 268, in _do_exit
2025-09-29T15:56:53.056+03:00 raise asyncio.TimeoutError
m
Your message was too long, here's your message at the allowed limit: we are getting
asyncio exceptions TimeoutError timed out during opening handshake
when starting multiple deployments from a flow ```2025-09-29T15 56 43 043+03 00 12 56 43 042 | DEBUG | prefect events clients - Reconnecting 12 56 43 042 | DEBUG | prefect events clients - Reconnecting 2025-09-29T15 56 53 056+03 00 Traceback (most recent call last) 2025-09-29T15 56 53 056+03 00 File /app/venv_deps/lib/python3 9/site-packages/websockets/asyncio/client py line 541 in __await_impl__ 2025-09-29T15 56 53 056+03 00 self connection = await self create_connection() 2025-09-29T15 56 53 056+03 00 File /app/venv_deps/lib/python3 9/site-packages/websockets/asyncio/client py line 467 in create_connection 2025-09-29T15 56 53 056+03 00 _ connection = await loop create_connection(factory *kwargs) 2025-09-29T15 56 53 056+03 00 File /usr/local/lib/python3 9/asyncio/base_events py line 1050 in create_connection 2025-09-29T15 56 53 056+03 00 sock = await self _connect_sock( 2025-09-29T15 56 53 056+03 00 File /usr/local/lib/python3 9/asyncio/base_events py line 961 in _connect_sock 2025-09-29T15 56 53 056+03 00 await self sock_connect(sock address) 2025-09-29T15 56 53 056+03 00 File /usr/local/lib/python3 9/asyncio/selector_events py line 500 in sock_connect 2025-09-29T15 56 53 056+03 00 return await fut 2025-09-29T15 56 53 056+03 00 asyncio exceptions CancelledError 2025-09-29T15 56 53 056+03 00 During handling of the above exception another exception occurred 2025-09-29T15 56 53 056+03 00 Traceback (most recent call last) 2025-09-29T15 56 53 056+03 00 File /app/venv_deps/lib/python3 9/site-packages/websockets/asyncio/client py line 574 in __await_impl__ 2025-09-29T15 56 53 056+03 00 raise SecurityError(f more than {MAX_REDIRECTS} redirects ) 2025-09-29T15 56 53 056+03 00 File /app/venv_deps/lib/python3 9/site-packages/websockets/asyncio/async_timeout py line 181 in aexit 2025-09-29T15 56 53 056+03 00 self _do_exit(exc_type) 2025-09-29T15 56 53 056+03 00 File /app/venv_deps/lib/python3 9/site-packages/websockets/asyncio/async_timeout py line 268 in _do_exit 2025-09-29T15 56 53 056+03 00 raise asyncio TimeoutError 2025-09-29T15 56 53 056+03 00 asyncio exceptions TimeoutError 2025-09-29T15 56 53 056+03 00 The above exception was the direct cause of the following exception 2025-09-29T15 56 53 056+03 00 Traceback (most recent call last) 2025-09-29T15 56 53 056+03 00 File /app/venv_deps/lib/python3 9/site-packages/prefect/cli/_utilities py line 44 in wrapper 2025-09-29T15 56 53 056+03 00 return fn(*args *kwargs) 2025-09-29T15 56 53 056+03 00 File /app/venv_deps/lib/python3 9/site-packages/prefect/cli/_types py line 156 in sync_fn 2025-09-29T15 56 53 056+03 00 return asyncio run(async_fn(*args **kwargs)) 2025-09-29T15 56 53 056+03 00 File /usr/local/lib/python3 9/asyncio/runners py line 44 in run 2025-09-29T15 56 53 056+03 00 return loop run_until_complete(main) 2025-09-29T15 56 53 056+03 00 File /usr/local/lib/python3 9/asyncio/base_events py line 647 in run_until_complete 2025-09-29T15 56 53 056+03 00 return future result() 2025-09-29T15 56 53 056+03 00 File /app/venv_deps/lib/python3 9/site-packages/prefect/cli/flow_run py line 412 in execute 2025-09-29T15 56 53 056+03 00 await runner execute_flow_run(id) 2025-09-29T15 56 53 056+03 00 File /app/venv_deps/lib/python3 9/site-packages/prefect/runner/runner py line 571 in execute_flow_run 2025-09-29T15 56 53 056+03 00 async with context 2025-09-29T15 56 53 056+03 00 File /app/venv_deps/lib/python3 9/site-packages/prefect/runner/runner py line 1499 in aenter 2025-09-29T15 56 53 056+03 00 await self _exit_stack enter_async_context( 2025-09-29T15 56 53 056+03 00 File /usr/local/lib/python3 9/contextlib py line 575 in enter_async_context 2025-09-29T15 56 53 056+03 00 result = await _cm_type __aenter__(cm) 2025-09-29T15 56 53 056+03 00 File /app/venv_deps/lib/python3 9/site-packages/prefect/runner/_observers py line 42 in aenter
2025-09-29T15 56 53 056+03 00 self _events_subscriber = await self _exit_stack
thought for 299.7 seconds
p
this is the second part:
Copy code
2025-09-29T15:56:53.056+03:00 asyncio.exceptions.TimeoutError
2025-09-29T15:56:53.056+03:00 The above exception was the direct cause of the following exception:
2025-09-29T15:56:53.056+03:00 Traceback (most recent call last):
2025-09-29T15:56:53.056+03:00 File "/app/venv_deps/lib/python3.9/site-packages/prefect/cli/_utilities.py", line 44, in wrapper
2025-09-29T15:56:53.056+03:00 return fn(*args, **kwargs)
2025-09-29T15:56:53.056+03:00 File "/app/venv_deps/lib/python3.9/site-packages/prefect/cli/_types.py", line 156, in sync_fn
2025-09-29T15:56:53.056+03:00 return asyncio.run(async_fn(*args, **kwargs))
2025-09-29T15:56:53.056+03:00 File "/usr/local/lib/python3.9/asyncio/runners.py", line 44, in run
2025-09-29T15:56:53.056+03:00 return loop.run_until_complete(main)
2025-09-29T15:56:53.056+03:00 File "/usr/local/lib/python3.9/asyncio/base_events.py", line 647, in run_until_complete
2025-09-29T15:56:53.056+03:00 return future.result()
2025-09-29T15:56:53.056+03:00 File "/app/venv_deps/lib/python3.9/site-packages/prefect/cli/flow_run.py", line 412, in execute
2025-09-29T15:56:53.056+03:00 await runner.execute_flow_run(id)
2025-09-29T15:56:53.056+03:00 File "/app/venv_deps/lib/python3.9/site-packages/prefect/runner/runner.py", line 571, in execute_flow_run
2025-09-29T15:56:53.056+03:00 async with context:
2025-09-29T15:56:53.056+03:00 File "/app/venv_deps/lib/python3.9/site-packages/prefect/runner/runner.py", line 1499, in __aenter__
2025-09-29T15:56:53.056+03:00 await self._exit_stack.enter_async_context(
2025-09-29T15:56:53.056+03:00 File "/usr/local/lib/python3.9/contextlib.py", line 575, in enter_async_context
2025-09-29T15:56:53.056+03:00 result = await _cm_type.__aenter__(cm)
2025-09-29T15:56:53.056+03:00 File "/app/venv_deps/lib/python3.9/site-packages/prefect/runner/_observers.py", line 42, in __aenter__
2025-09-29T15:56:53.056+03:00 self._events_subscriber = await self._exit_stack.enter_async_context(
2025-09-29T15:56:53.056+03:00 File "/usr/local/lib/python3.9/contextlib.py", line 575, in enter_async_context
2025-09-29T15:56:53.056+03:00 result = await _cm_type.__aenter__(cm)
2025-09-29T15:56:53.056+03:00 File "/app/venv_deps/lib/python3.9/site-packages/prefect/events/clients.py", line 555, in __aenter__
2025-09-29T15:56:53.056+03:00 await self._reconnect()
2025-09-29T15:56:53.056+03:00 File "/app/venv_deps/lib/python3.9/site-packages/prefect/events/clients.py", line 566, in _reconnect
2025-09-29T15:56:53.056+03:00 self._websocket = await self._connect.__aenter__()
2025-09-29T15:56:53.056+03:00 File "/app/venv_deps/lib/python3.9/site-packages/websockets/asyncio/client.py", line 587, in __aenter__
2025-09-29T15:56:53.056+03:00 return await self
2025-09-29T15:56:53.056+03:00 File "/app/venv_deps/lib/python3.9/site-packages/prefect/_internal/websockets.py", line 135, in _proxy_connect
2025-09-29T15:56:53.056+03:00 proto = await self.__await_impl__()
2025-09-29T15:56:53.056+03:00 File "/app/venv_deps/lib/python3.9/site-packages/websockets/asyncio/client.py", line 578, in __await_impl__
2025-09-29T15:56:53.056+03:00 raise TimeoutError("timed out during opening handshake") from exc
2025-09-29T15:56:53.056+03:00 asyncio.exceptions.TimeoutError: timed out during opening handshake
2025-09-29T15:56:53.057+03:00 An exception occurred.
m
Thanks for the stack trace — that’s the Prefect events WebSocket retrying and timing out during the opening handshake. A few quick notes: - This is about the events channel (used for emitting/observing events, automations, etc.), not the core HTTP API. Your flow runs and deployments generally still work; the client will keep retrying and log “Reconnecting…”. - The error is typically caused by network conditions (blocked/slow websockets, proxies, NAT saturation) or too many concurrent connections at once. A few checks and mitigations you can try 1) Confirm your API URL and network allow websockets - Make sure
PREFECT_API_URL
includes the scheme and is correct (e.g. `https://api.prefect.cloud/api/accounts/&lt;account_id&gt;/workspaces/&lt;ws_id&gt;`). - If you’re behind a proxy or egress filter, verify that outbound WSS (port 443) to Prefect’s API is allowed. Some corporate proxies/firewalls block websocket upgrades which will cause opening-handshake timeouts. - If you have `HTTPS_PROXY`/`HTTP_PROXY` env vars set, try a run without them to see if the proxy is interfering with WSS. - Docs: Troubleshoot Prefect Cloud, Settings reference 2) Reduce the number of simultaneous connections If you’re starting many deployments at once from a parent flow, bursty fan-out can cause lots of concurrent WebSocket dials. Try throttling to a small concurrency (e.g., 5) or running sequentially. Example — launch multiple deployments with a small concurrency:
Copy code
from prefect import flow
from prefect.deployments import run_deployment
import anyio

@flow
async def launch_many(names: list[str]):
    sem = anyio.Semaphore(5)  # limit concurrency

    async def launch_one(name: str):
        async with sem:
            # If you don't need to wait for each child run, omit waiting logic
            return await run_deployment(name=name)

    # schedule with throttling
    return await anyio.gather(*[launch_one(n) for n in names])

# names = ["flow-a/dev", "flow-b/dev", ...]
# launch_many(names)
API ref for `run_deployment`: run_deployment 3) Turn down the noise (if runs are otherwise fine) - Set
PREFECT_LOGGING_LEVEL=INFO
(or WARN) to hide the debug “Reconnecting…” spam. - You can inspect your current settings with:
Copy code
prefect config view --show-defaults
4) Environment specifics that commonly trigger this - Running inside containers behind a corporate proxy that does not support WebSocket upgrade - Tight egress rules that only allow plain HTTPS but not WS/WSS - Very large fan-out (dozens/hundreds of concurrent child deployments) causing transient connection pressure A few questions to help pinpoint the root cause - Are you on Prefect Cloud or self-hosted Prefect Server? - Which Prefect version are you running? - Roughly how many deployments do you start concurrently from the flow? - Are you behind a proxy or strict firewall/NAT? - Do the deployments themselves succeed and just log these timeouts, or are any runs failing? If you share those details (and a minimal snippet of how you’re launching the deployments), I can suggest the most targeted fix.
p
@Marvin We are running self-hosted prefect server, version 3.4.8. We don't see this happening on all the deployments, just some of those and the flow completely fails on it. What could be the bottleneck here and how can I increase the number of concurrent deployments without experiencing this error?
m
thought for 133.9 seconds
Thanks for the details — since you’re on self‑hosted Prefect Server 3.4.8 and only some runs fail with an “opening handshake” timeout, the most common bottleneck is the events WebSocket path through your ingress/proxy layer when you create a burst of connections. What’s happening - Each flow process connects a WebSocket for events (used for live UI updates, automations, etc.). When you fan out many deployments at once, you create a burst of concurrent WS upgrade requests. If your ingress/proxy isn’t fully configured for WebSockets or has tight limits/timeouts, some connections will stall, reach the client’s handshake timeout, and raise — which can bubble up and crash your parent flow. - This is not the core HTTP API; it’s the events client: you’ll see logs like “prefect.events.clients - Reconnecting…”. Where the bottleneck usually is (self‑hosted) - Reverse proxy/ingress not forwarding WS Upgrade headers or using short timeouts - Connection/worker limits on the proxy or API (e.g., too few workers, low max connections) - Service mesh or corporate proxy interfering with WS - High fan‑out causing a transient spike of simultaneous connections Recommended fixes and mitigations A) Verify your ingress/proxy supports WebSockets and increase timeouts - NGINX (typical config) - Ensure these are set on the route(s) to your API:
Copy code
proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection "upgrade";
    proxy_read_timeout 600s;
    proxy_send_timeout 600s;
- Traefik/ALB/Istio - Confirm WS upgrade is allowed and idle/read timeouts are >= 60–300s - For ALB, ensure HTTP/1.1 to the target and increase idle timeout (> 60s) - If you terminate TLS upstream, ensure the upstream hop is still HTTP/1.1 (pure HTTP/2 upstream can break WS upgrades) B) Scale API capacity and connection limits - Increase API replicas (if running the API behind a load balancer) - Ensure sufficient CPU/memory for the API pod/container - Raise proxy/server connection limits (e.g., NGINX worker_connections, ulimit -n on hosts) C) Reduce client‑side burstiness and add retries Throttle the number of simultaneous
run_deployment
calls and add a retry around them. This both eases the WS handshake burst and makes the orchestration robust if one WS dial transiently fails. Example limiting concurrency to 5 and turning off subflow semantics (to reduce coupling) with retries:
Copy code
from prefect import flow
from prefect.deployments import run_deployment
import anyio
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=0.5, max=10), reraise=True)
async def run_one(name, params=None):
    # as_subflow=False reduces “child as subflow” linkage; often lowers extra coupling
    return await run_deployment(name=name, parameters=params, as_subflow=False)

@flow
async def launch_many(names: list[str]):
    sem = anyio.Semaphore(5)  # tune this upward once ingress is fixed

    async def launch(name: str):
        async with sem:
            try:
                return await run_one(name)
            except Exception as exc:
                # log and re-raise if you want the parent flow to fail, or swallow if you prefer partial success
                raise

    return await anyio.gather(*[launch(n) for n in names])
API reference: run_deployment D) If you don’t need event streaming, disable it (server‑side) - On the server, set
PREFECT_API_EVENTS_STREAM_OUT_ENABLED=false
to stop serving the events stream. This disables live event streaming/automations from the server; the UI will still show state via polling, but live streams won’t be available. - Settings reference: Settings reference - Note: do this only if you’re comfortable without live event stream/automations during high‑concurrency bursts.
E) Make sure your API URL and network are clean -
PREFECT_API_URL
should include the scheme and be reachable (http/https) - Avoid proxies that block WS upgrades; unset
HTTP(S)_PROXY
for the Prefect processes if necessary - You can inspect effective settings with:
Copy code
prefect config view --show-defaults
Next step — a bit of context will let me be more specific - How many deployments are you launching at peak (and from how many processes/pods)? - What ingress/proxy are you using (NGINX, Traefik, ALB, Istio, etc.)? - Are the failures clustered at high concurrency or seemingly random? - If you have NGINX/ingress config, can you share the WS-related snippet? With those details, I can give exact ingress settings and a safe concurrency target to start with, then we can tune up from there.
p
@Marvin Im running from 10 to 50 deployments and 10% of those usually fail on timeouterror
m
thought for 59.9 seconds
Got it — thanks for the numbers. With 10–50 concurrent
run_deployment
calls and ~10% failing, this almost always points to your self-hosted API/ingress not handling bursts of WebSocket upgrades for the events channel. What’s likely the bottleneck - Your ingress/proxy (NGINX/Traefik/ALB/etc.) is either: - Not fully configured for WebSocket upgrades (Upgrade/Connection headers) - Using short read/connect/idle timeouts - Hitting connection/worker limits under burst - The Prefect API instance(s) may not have enough capacity for the spike. - The client opens/maintains an events WebSocket; under load, some handshakes time out and bubble up, failing the flow. How to raise concurrency without the timeouts Server/ingress hardening (most important) - Ensure WS upgrade is enabled and timeouts are high enough. For NGINX, add/verify:
Copy code
proxy_http_version 1.1;
  proxy_set_header Upgrade $http_upgrade;
  proxy_set_header Connection "upgrade";
  proxy_connect_timeout 60s;
  proxy_read_timeout 600s;
  proxy_send_timeout 600s;
  keepalive_timeout 75s;
- If behind a load balancer/service mesh: - ALB/NLB: idle timeout ≥ 60s; ensure HTTP/1.1 on the upstream target group. - Traefik/Istio: enable WS, increase read/idle timeouts. - Increase capacity: - Scale API replicas (multiple server pods/containers behind LB). - Raise proxy connection limits (e.g., NGINX
worker_connections
), and ensure sufficient CPU/memory on API pods. - Optional but helpful: enable sticky sessions (some proxies handle a flood of WS better when the connection stays on one upstream). Client-side throttling and resiliency - Throttle the fan-out and add retries. Start with concurrency 10–20 while you adjust ingress, then scale up. - Set
as_subflow=False
to decouple child runs from the parent flow’s state. - Wrap
run_deployment
in retries with exponential backoff. Example:
Copy code
from prefect import flow
from prefect.deployments import run_deployment
import anyio
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(stop=stop_after_attempt(5), wait=wait_exponential(multiplier=0.5, max=10), reraise=True)
async def run_one(name, params=None):
    return await run_deployment(name=name, parameters=params, as_subflow=False)

@flow
async def launch_many(names: list[str]):
    sem = anyio.Semaphore(15)  # tune up after ingress fixes (e.g., 20, 30, ...)
    async def launch(n): 
        async with sem:
            return await run_one(n)
    return await anyio.gather(*[launch(n) for n in names])
Upgrade to latest 3.4.x - You’re on 3.4.8. We’ve shipped stability improvements across 3.4.x. It’s worth upgrading to the latest 3.4.x to pick up resiliency fixes. - Release notes: 3.4.20 changelog Optional: temporarily disable events streaming (server) - If you need an immediate stopgap and can live without live event streams, set this on the server to reduce WS traffic:
Copy code
PREFECT_API_EVENTS_STREAM_OUT_ENABLED=false
- Trade-off: live event streaming/automations are affected; UI continues to work via polling. See settings: Settings reference. Sanity test for your ingress - From a pod/host in the same network, try a direct WS connect to your API:
Copy code
# events out stream (read)
  wscat -c wss://<your-api-host>/events/out
  # events in stream (emit)
  wscat -c wss://<your-api-host>/events/in
If these hang or fail intermittently under load, it confirms the ingress/LB path. Helpful references -
run_deployment
API: run_deployment - Settings reference: Prefect 3 settings If you can share: - Which ingress/LB you’re using (NGINX, Traefik, ALB, Istio, etc.) - Any relevant ingress config for the Prefect API route
- How you’ve deployed Prefect Server (K8s/Helm, Docker Compose, etc.) …I can give exact config snippets and a recommended concurrency target to start with (e.g., 20 → 30 → 50) while we ensure the WS path scales cleanly.