< Marvin> In the middle of my prefect flow I get `Unable to Prefect Community #ask-marvin

<@ULVA73B9P> In the middle of my prefect flow I ge...

Andrew Xue

08/15/2025, 6:38 PM

@Marvin In the middle of my prefect flow I get `Unable to connect to 'wss://...'. Please check your network settings to ensure websocket connections to the API are allowed. Otherwise event data (including task run data) may be lost. Reason: timed out during opening handshake. Set PREFECT_DEBUG_MODE=1 to see the full error.

Marvin

08/15/2025, 6:38 PM

✅ thought for 149.2 seconds

Marvin

08/15/2025, 6:41 PM

That warning means the runtime couldn’t open its websocket back to the Prefect API. Your flow keeps running, but real-time event data (states/logs) may be delayed or dropped until it reconnects. What usually causes it - Corporate/firewall/proxy blocking websocket (wss/ws) traffic - Misconfigured reverse proxy/load balancer not passing `Upgrade`/`Connection: upgrade` - TLS interception/SSL inspection or cert issues - Idle/read timeouts that are too short for long-lived websockets - Wrong

PREFECT_API_URL

(the ws URL is derived from it: https→wss, http→ws) First checks - Set

PREFECT_DEBUG_MODE=1

and re-run to see the underlying error (TLS, 403, proxy reset, etc.) - Confirm the exact URL in the warning is reachable from that environment. If you can, test with a tool like

wscat

from the same network. - Verify

PREFECT_API_URL

is correct and reachable over HTTPS if you expect

wss://

. If you’re on Prefect Cloud - Ensure outbound egress allows

wss

api.prefect.cloud:443

without TLS interception. - If you must use a proxy, it must support the HTTP/1.1 Upgrade handshake for websockets. If you’re self-hosting behind a reverse proxy (e.g., NGINX) - You must pass websocket headers and give it generous timeouts. Minimal NGINX location example:

Copy code

location /api/ws/ {
  proxy_pass <http://prefect-server>;   # your upstream service
  proxy_http_version 1.1;
  proxy_set_header Upgrade $http_upgrade;
  proxy_set_header Connection "upgrade";
  proxy_read_timeout 600s;
  proxy_send_timeout 600s;
}

- For NGINX Ingress, add annotations like:

Copy code

<http://nginx.ingress.kubernetes.io/proxy-read-timeout|nginx.ingress.kubernetes.io/proxy-read-timeout>: "600"
<http://nginx.ingress.kubernetes.io/proxy-send-timeout|nginx.ingress.kubernetes.io/proxy-send-timeout>: "600"

- For ALB/ELB, increase idle timeout (e.g., ≥300s). For Cloudflare, ensure WebSocket is enabled and not being terminated early. Can I disable it? - There isn’t a supported “turn off websockets” switch in Prefect 3.x. If you just want to silence the warning, you can lower log verbosity, e.g.:

Copy code

PREFECT_LOGGING_LEVEL=ERROR

…but you’ll hide other useful warnings too. Better to fix the ws path. Docs you may find useful - Server Helm (ingress/websockets) - Security settings - Settings reference If you share: - Whether you’re on Cloud or self-hosted - The exact

wss://...

URL from the warning - Any proxy/load balancer in the path …I can give you a targeted config snippet instead of this thrilling generality.

Andrew Xue

08/15/2025, 6:43 PM

@Marvin I can see the events in the prefect server ui at the beginning of my flow, but it loses connection midway

Marvin

08/15/2025, 6:43 PM

✅ thought for 45.0 seconds

Marvin

08/15/2025, 6:44 PM

If it connects at the start then drops mid‑flow, that’s almost always an idle/keepalive timeout somewhere in the path. The websocket stays open for the duration of the run; many proxies/LBs/firewalls kill “quiet” connections after N seconds, then the client tries to reconnect and the handshake times out. What to check/fix - Reverse proxy/LB timeouts: - NGINX (self‑hosted):

Copy code

location /api/ws/ {
      proxy_pass <http://prefect-server>;   # your upstream
      proxy_http_version 1.1;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
      proxy_read_timeout 600s;   # increase as needed
      proxy_send_timeout 600s;
    }

For Kubernetes NGINX Ingress:

Copy code

<http://nginx.ingress.kubernetes.io/proxy-read-timeout|nginx.ingress.kubernetes.io/proxy-read-timeout>: "600"
    <http://nginx.ingress.kubernetes.io/proxy-send-timeout|nginx.ingress.kubernetes.io/proxy-send-timeout>: "600"

- AWS ALB/ELB: raise the idle timeout (e.g., ≥ 300s). - Cloudflare/other CDNs: ensure WebSocket is enabled and not subject to short timeouts. - Corporate proxy/egress filters: allow wss to your Prefect API domain and exclude it from TLS inspection. Websockets over CONNECT must be permitted. - Certificate/TLS issues on reconnect: enable debug to confirm. - Server restarts: if you self‑host, ensure the events service stays healthy during the run. Quick diagnostics - Set

PREFECT_DEBUG_MODE=1

and rerun to capture the exact disconnect + handshake error. - Note roughly when it drops (e.g., ~60s, ~120s…) and match that to any proxy/LB idle defaults. - Confirm

PREFECT_API_URL

is correct; the client derives

wss://

from it. A pragmatic workaround (keeps traffic flowing) - If a single long task is “quiet” for minutes, emit a small heartbeat log periodically:

Copy code

from prefect import task, get_run_logger
import time

@task
def long_task():
    logger = get_run_logger()
    for i in range(1800):
        # ... your work here ...
        if i % 60 == 0:
            <http://logger.info|logger.info>("still working...")
        time.sleep(1)

This won’t fix a too‑short hard idle timeout, but it avoids “silent” periods that some proxies kill. Docs you may want - Server/Ingress notes (websockets) - Security and networking - Settings reference If you share whether you’re on Cloud or self‑hosted, the exact wss URL, and any proxy/LB in front, I can suggest precise settings instead of waving at the ether.

Andrew Xue

08/15/2025, 7:20 PM

@Marvin What debug logs should I be looking for?

Marvin

08/15/2025, 7:20 PM

✅ thought for 86.8 seconds

Marvin

08/15/2025, 7:21 PM

Short version: enable debug, then look for the lifecycle of the websocket: connect → 101/connected → idle → drop/close → retry/backoff → reconnect/backfill or error. The exact wording varies by environment, but these patterns are consistent. How to turn the noise up - Set:

Copy code

PREFECT_DEBUG_MODE=1
PREFECT_LOGGING_LEVEL=DEBUG

- Optional: also enable underlying libraries so you see the handshake and close codes:

Copy code

# Early in your script, before importing/starting flows
import logging
for name in ("prefect", "prefect.client", "websockets", "websockets.client", "httpx", "anyio"):
    logging.getLogger(name).setLevel(logging.DEBUG)

What to look for (typical patterns) - Connection attempts - “Connecting/opening websocket to wss://…” - “GET /api/ws/… 101 Switching Protocols” (from web/proxy logs if you have them) - Successful connection - “Websocket connected/handshake complete” - Library traces from

websockets.client

showing the upgrade - Disconnects and timeouts (your symptom) - “Connection closed: code=1006/1001/1000” (1006 is abnormal close; often idle timeout) - “timed out during opening handshake” (during reconnect) - “Read/Write timeout” or “EOF in violation of protocol” - “Server closed connection” or “Network is unreachable” - Retries/backoff - “Websocket disconnected; retrying in Xs” (often with exponential backoff) - Proxy/load balancer problems - “InvalidStatusCode: 403/400 while opening websocket” - “Missing/invalid Upgrade header” or “Connection: keep-alive” instead of “upgrade” - “Proxy CONNECT failed/closed by proxy” - TLS/cert issues - “ssl.SSLError: CERTIFICATE_VERIFY_FAILED” - “tlsv1 alert unknown ca” or hostname mismatch - Backfill after reconnect (if enabled/possible) - Mentions of “backfill”, “replaying”, or “catching up” events since timestamp If you’re self‑hosting, also check server-side logs around the drop: - Prefect server logs for events/subscriptions (often under “prefect.server” or “subscriptions”) - Reverse proxy/ingress logs for 101 responses on /api/ws/* and any 400/403/499/504 around the same time - LB/CDN logs for idle timeout hits Correlate the timestamp when it drops with any fixed timeout (e.g., 60s, 120s, 300s). If you keep seeing ~300s, that’s your load balancer’s idle timeout politely ejecting you. Docs, in case you need more bedtime reading: - Settings: Prefect settings reference - Networking/security notes: Security settings - Ingress/Helm (websocket hints): Server Helm

Andrew Xue

08/15/2025, 7:33 PM

@Marvin

Copy code

prefect.events.clients - Unable to connect to 'wss://.../api/events/in'. Please check your network settings to ensure websocket connections to the API are allowed. Otherwise event data (including task run data) may be lost. Reason: timed out during opening handshake. Set PREFECT_DEBUG_MODE=1 to see the full error.
(task, pid=261655) Traceback (most recent call last):
(task, pid=261655)   File "/home/gcpuser/.cache/uv/archive-v0/mfCFK0KR3mRjsoaUT2Vxj/lib/python3.12/site-packages/websockets/asyncio/client.py", line 541, in __await_impl__
(task, pid=261655)     self.connection = await self.create_connection()
(task, pid=261655)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(task, pid=261655)   File "/home/gcpuser/.cache/uv/archive-v0/mfCFK0KR3mRjsoaUT2Vxj/lib/python3.12/site-packages/websockets/asyncio/client.py", line 467, in create_connection
(task, pid=261655)     _, connection = await loop.create_connection(factory, **kwargs)
(task, pid=261655)                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(task, pid=261655)   File "/home/gcpuser/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/base_events.py", line 1107, in create_connection
(task, pid=261655)     sock = await self._connect_sock(
(task, pid=261655)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(task, pid=261655)   File "/home/gcpuser/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/base_events.py", line 1010, in _connect_sock
(task, pid=261655)     await self.sock_connect(sock, address)
(task, pid=261655)   File "/home/gcpuser/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/selector_events.py", line 651, in sock_connect
(task, pid=261655)     return await fut
(task, pid=261655)            ^^^^^^^^^
(task, pid=261655) asyncio.exceptions.CancelledError
(task, pid=261655) 
(task, pid=261655) The above exception was the direct cause of the following exception:
(task, pid=261655) 
(task, pid=261655) Traceback (most recent call last):
(task, pid=261655)   File "/home/gcpuser/.cache/uv/archive-v0/mfCFK0KR3mRjsoaUT2Vxj/lib/python3.12/site-packages/websockets/asyncio/client.py", line 539, in __await_impl__
(task, pid=261655)     async with asyncio_timeout(self.open_timeout):
(task, pid=261655)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(task, pid=261655)   File "/home/gcpuser/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/timeouts.py", line 115, in __aexit__
(task, pid=261655)     raise TimeoutError from exc_val
(task, pid=261655) TimeoutError
(task, pid=261655) 
(task, pid=261655) The above exception was the direct cause of the following exception:
(task, pid=261655) 
(task, pid=261655) Traceback (most recent call last):
(task, pid=261655)   File "/home/gcpuser/.cache/uv/archive-v0/mfCFK0KR3mRjsoaUT2Vxj/lib/python3.12/site-packages/prefect/events/clients.py", line 311, in _reconnect
(task, pid=261655)     self._websocket = await self._connect.__aenter__()
(task, pid=261655)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(task, pid=261655)   File "/home/gcpuser/.cache/uv/archive-v0/mfCFK0KR3mRjsoaUT2Vxj/lib/python3.12/site-packages/websockets/asyncio/client.py", line 587, in __aenter__
(task, pid=261655)     return await self
(task, pid=261655)            ^^^^^^^^^^
(task, pid=261655)   File "/home/gcpuser/.cache/uv/archive-v0/mfCFK0KR3mRjsoaUT2Vxj/lib/python3.12/site-packages/prefect/_internal/websockets.py", line 135, in _proxy_connect
(task, pid=261655)     proto = await self.__await_impl__()
(task, pid=261655)             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(task, pid=261655)   File "/home/gcpuser/.cache/uv/archive-v0/mfCFK0KR3mRjsoaUT2Vxj/lib/python3.12/site-packages/websockets/asyncio/client.py", line 578, in __await_impl__
(task, pid=261655)     raise TimeoutError("timed out during opening handshake") from exc
(task, pid=261655) TimeoutError: timed out during opening handshake

Marvin

08/15/2025, 7:33 PM

✅ thought for 58.2 seconds

Marvin

08/15/2025, 7:34 PM

Thanks for the stack trace — it’s helpful. The key bits: -

prefect._internal.websockets._proxy_connect

appears in the traceback. That means your run is attempting to go through a proxy (env like `HTTPS_PROXY`/`HTTP_PROXY`/`ALL_PROXY` is set). - The failure is

TimeoutError: timed out during opening handshake

while reconnecting. So the TCP/TLS/HTTP Upgrade never completes — classic proxy/LB/firewall behavior, not an app crash. Likely causes - Corporate proxy that doesn’t support WebSocket upgrades or requires auth - Proxy/LB idle timeout dropped the first connection; reconnect attempts can’t complete the upgrade - Proxy/TLS inspection interfering with the TLS handshake - Wrong/no

NO_PROXY

so your traffic is forced through a proxy needlessly Targeted checks/fixes 1) Check whether a proxy is in play

Copy code

echo $HTTPS_PROXY
echo $HTTP_PROXY
echo $ALL_PROXY
echo $NO_PROXY

- If you can reach the API directly, bypass the proxy for it:

Copy code

# include your API host and (if needed) domain
export NO_PROXY="<http://your-api.example.com|your-api.example.com>,.<http://example.com|example.com>,localhost,127.0.0.1"
# or unset the proxy entirely for this process
unset HTTPS_PROXY HTTP_PROXY ALL_PROXY

Then rerun. 2) If you must use a proxy - Ensure it supports WebSocket (HTTP/1.1 Upgrade) and allows CONNECT to the API host:443. - If it requires auth, include credentials in

HTTPS_PROXY

(or use a proxy config your security folks approve). - If there’s TLS interception, install the proxy’s CA and point to it:

Copy code

export SSL_CERT_FILE=/path/to/corp-ca.pem
# often also helpful
export REQUESTS_CA_BUNDLE=$SSL_CERT_FILE

3) Self-hosted behind NGINX/Ingress/LB - Make sure the events websocket path is upgraded and not subject to short idle timeouts. For NGINX:

Copy code

location /api/events/ {
  proxy_pass <http://prefect-server>;    # your upstream
  proxy_http_version 1.1;
  proxy_set_header Upgrade $http_upgrade;
  proxy_set_header Connection "upgrade";
  proxy_read_timeout 600s;
  proxy_send_timeout 600s;
}

- For NGINX Ingress:

Copy code

<http://nginx.ingress.kubernetes.io/proxy-read-timeout|nginx.ingress.kubernetes.io/proxy-read-timeout>: "600"
<http://nginx.ingress.kubernetes.io/proxy-send-timeout|nginx.ingress.kubernetes.io/proxy-send-timeout>: "600"

- For ALB/ELB: increase idle timeout (≥300s). For Cloudflare/CDN: enable WebSocket and avoid short timeouts. 4) Quick repro without Prefect - From the same machine/network, try:

Copy code

wscat -c wss://<your-host>/api/events/in

or a tiny Python probe with a longer open timeout. If that times out too, it’s definitely the network/proxy path. 5) Turn up logs to confirm

Copy code

PREFECT_DEBUG_MODE=1 PREFECT_LOGGING_LEVEL=DEBUG

Look for: - Proxy errors (403/400), missing Upgrade/Connection headers - TLS errors (CERTIFICATE_VERIFY_FAILED) - Repeatable disconnect timing (e.g., ~300s → LB idle timeout) Docs for reference - Security and networking notes - Server/Ingress (websockets) - Settings reference If you can share: - Whether you’re on Prefect Cloud or self‑hosted - The presence/value of `HTTPS_PROXY`/`NO_PROXY` - Any proxy/LB in front and the approximate drop timing …I can suggest exact config lines rather than making you play network whack‑a‑mole.

Andrew Xue

08/15/2025, 10:15 PM

@Marvin Is this an issue with the number of tasks I am creating? This usually happens after I spawn 6k+ tasks

Marvin

08/15/2025, 10:15 PM

✅ thought for 61.4 seconds

Marvin

08/15/2025, 10:16 PM

Short answer: not inherently. Prefect keeps a single websocket per process; spawning 6k tasks doesn’t exceed some “ws limit.” But 6k+ tasks usually means: - Long runtime (more time to hit proxy/LB idle/rolling timeouts) - A flood of events/logs (which can stress a proxy or server and make reconnects slow) Your traceback shows

prefect._internal.websockets._proxy_connect

, so a proxy is in play. The handshake is timing out during reconnect — classic proxy/load balancer behavior when under load or with strict timeouts. What to try first 1) Bypass the proxy for the Prefect API host

Copy code

echo $HTTPS_PROXY; echo $HTTP_PROXY; echo $ALL_PROXY; echo $NO_PROXY
# If possible, bypass:
export NO_PROXY="your-api.host,.your-domain,localhost,127.0.0.1"
# or temporarily unset to test
unset HTTPS_PROXY HTTP_PROXY ALL_PROXY

If that fixes it, add a proper NO_PROXY permanently. 2) If a proxy is required, raise limits/timeouts and support websockets - Ensure HTTP/1.1 Upgrade is allowed and idle/keepalive timeouts are generous. - NGINX example:

Copy code

location /api/events/ {
  proxy_pass <http://prefect-server>;
  proxy_http_version 1.1;
  proxy_set_header Upgrade $http_upgrade;
  proxy_set_header Connection "upgrade";
  proxy_read_timeout 600s;
  proxy_send_timeout 600s;
}

- Ingress annotations:

Copy code

<http://nginx.ingress.kubernetes.io/proxy-read-timeout|nginx.ingress.kubernetes.io/proxy-read-timeout>: "600"
<http://nginx.ingress.kubernetes.io/proxy-send-timeout|nginx.ingress.kubernetes.io/proxy-send-timeout>: "600"

- ALB/ELB: increase idle timeout (≥300s). Cloudflare/CDN: enable WebSocket and avoid short timeouts/rate limits. 3) Reduce event volume during the 6k surge - Cut API logging as a test:

Copy code

export PREFECT_LOGGING_TO_API=false
export PREFECT_LOGGING_LEVEL=WARNING

- Avoid chatty prints; set

log_prints=False

on very busy tasks, and throttle heartbeat logs. - If you’re emitting logs inside tight loops, batch them. 4) Sanity check from the same host/network

Copy code

wscat -c wss://<your-prefect-host>/api/events/in

If the handshake stalls there too (especially under load), it’s the path/proxy, not Prefect. 5) If self-hosting, ensure the API/events service isn’t saturated - Scale the API/events pods, check DB load, and watch for 5xx/4xx or slow upstreams around the time of the drop. Why it correlates with “6k+ tasks” - Longer runtime crosses fixed idle/rotation windows on proxies/LBs. - The event/log burst can trigger proxy throttling, connection limits, or resource contention so the reconnect’s HTTP Upgrade cannot complete within the websocket library’s open_timeout, yielding “timed out during opening handshake”. If you share: - Cloud vs self-hosted, the wss host, and whether

HTTPS_PROXY/NO_PROXY

are set - Rough wall-clock time when it drops (e.g., ~300s, ~900s) - Whether it coincides with a big fan-out/map phase (i.e., a burst of logs/events) …I’ll give you exact config knobs instead of generalities. Refs - Security/networking notes - Ingress/WebSocket considerations - Settings reference

19 Views

Open in Slack

Previous Next