Andrew Xue
08/15/2025, 6:38 PMMarvin
08/15/2025, 6:38 PMMarvin
08/15/2025, 6:41 PMPREFECT_API_URL
(the ws URL is derived from it: https→wss, http→ws)
First checks
- Set PREFECT_DEBUG_MODE=1
and re-run to see the underlying error (TLS, 403, proxy reset, etc.)
- Confirm the exact URL in the warning is reachable from that environment. If you can, test with a tool like wscat
from the same network.
- Verify PREFECT_API_URL
is correct and reachable over HTTPS if you expect wss://
.
If you’re on Prefect Cloud
- Ensure outbound egress allows wss
to api.prefect.cloud:443
without TLS interception.
- If you must use a proxy, it must support the HTTP/1.1 Upgrade handshake for websockets.
If you’re self-hosting behind a reverse proxy (e.g., NGINX)
- You must pass websocket headers and give it generous timeouts. Minimal NGINX location example:
location /api/ws/ {
proxy_pass <http://prefect-server>; # your upstream service
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 600s;
proxy_send_timeout 600s;
}
- For NGINX Ingress, add annotations like:
<http://nginx.ingress.kubernetes.io/proxy-read-timeout|nginx.ingress.kubernetes.io/proxy-read-timeout>: "600"
<http://nginx.ingress.kubernetes.io/proxy-send-timeout|nginx.ingress.kubernetes.io/proxy-send-timeout>: "600"
- For ALB/ELB, increase idle timeout (e.g., ≥300s). For Cloudflare, ensure WebSocket is enabled and not being terminated early.
Can I disable it?
- There isn’t a supported “turn off websockets” switch in Prefect 3.x. If you just want to silence the warning, you can lower log verbosity, e.g.:
PREFECT_LOGGING_LEVEL=ERROR
…but you’ll hide other useful warnings too. Better to fix the ws path.
Docs you may find useful
- Server Helm (ingress/websockets)
- Security settings
- Settings reference
If you share:
- Whether you’re on Cloud or self-hosted
- The exact wss://...
URL from the warning
- Any proxy/load balancer in the path
…I can give you a targeted config snippet instead of this thrilling generality.Andrew Xue
08/15/2025, 6:43 PMMarvin
08/15/2025, 6:43 PMMarvin
08/15/2025, 6:44 PMlocation /api/ws/ {
proxy_pass <http://prefect-server>; # your upstream
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 600s; # increase as needed
proxy_send_timeout 600s;
}
For Kubernetes NGINX Ingress:
<http://nginx.ingress.kubernetes.io/proxy-read-timeout|nginx.ingress.kubernetes.io/proxy-read-timeout>: "600"
<http://nginx.ingress.kubernetes.io/proxy-send-timeout|nginx.ingress.kubernetes.io/proxy-send-timeout>: "600"
- AWS ALB/ELB: raise the idle timeout (e.g., ≥ 300s).
- Cloudflare/other CDNs: ensure WebSocket is enabled and not subject to short timeouts.
- Corporate proxy/egress filters: allow wss to your Prefect API domain and exclude it from TLS inspection. Websockets over CONNECT must be permitted.
- Certificate/TLS issues on reconnect: enable debug to confirm.
- Server restarts: if you self‑host, ensure the events service stays healthy during the run.
Quick diagnostics
- Set PREFECT_DEBUG_MODE=1
and rerun to capture the exact disconnect + handshake error.
- Note roughly when it drops (e.g., ~60s, ~120s…) and match that to any proxy/LB idle defaults.
- Confirm PREFECT_API_URL
is correct; the client derives wss://
from it.
A pragmatic workaround (keeps traffic flowing)
- If a single long task is “quiet” for minutes, emit a small heartbeat log periodically:
from prefect import task, get_run_logger
import time
@task
def long_task():
logger = get_run_logger()
for i in range(1800):
# ... your work here ...
if i % 60 == 0:
<http://logger.info|logger.info>("still working...")
time.sleep(1)
This won’t fix a too‑short hard idle timeout, but it avoids “silent” periods that some proxies kill.
Docs you may want
- Server/Ingress notes (websockets)
- Security and networking
- Settings reference
If you share whether you’re on Cloud or self‑hosted, the exact wss URL, and any proxy/LB in front, I can suggest precise settings instead of waving at the ether.Andrew Xue
08/15/2025, 7:20 PMMarvin
08/15/2025, 7:20 PMMarvin
08/15/2025, 7:21 PMPREFECT_DEBUG_MODE=1
PREFECT_LOGGING_LEVEL=DEBUG
- Optional: also enable underlying libraries so you see the handshake and close codes:
# Early in your script, before importing/starting flows
import logging
for name in ("prefect", "prefect.client", "websockets", "websockets.client", "httpx", "anyio"):
logging.getLogger(name).setLevel(logging.DEBUG)
What to look for (typical patterns)
- Connection attempts
- “Connecting/opening websocket to wss://…”
- “GET /api/ws/… 101 Switching Protocols” (from web/proxy logs if you have them)
- Successful connection
- “Websocket connected/handshake complete”
- Library traces from websockets.client
showing the upgrade
- Disconnects and timeouts (your symptom)
- “Connection closed: code=1006/1001/1000” (1006 is abnormal close; often idle timeout)
- “timed out during opening handshake” (during reconnect)
- “Read/Write timeout” or “EOF in violation of protocol”
- “Server closed connection” or “Network is unreachable”
- Retries/backoff
- “Websocket disconnected; retrying in Xs” (often with exponential backoff)
- Proxy/load balancer problems
- “InvalidStatusCode: 403/400 while opening websocket”
- “Missing/invalid Upgrade header” or “Connection: keep-alive” instead of “upgrade”
- “Proxy CONNECT failed/closed by proxy”
- TLS/cert issues
- “ssl.SSLError: CERTIFICATE_VERIFY_FAILED”
- “tlsv1 alert unknown ca” or hostname mismatch
- Backfill after reconnect (if enabled/possible)
- Mentions of “backfill”, “replaying”, or “catching up” events since timestamp
If you’re self‑hosting, also check server-side logs around the drop:
- Prefect server logs for events/subscriptions (often under “prefect.server” or “subscriptions”)
- Reverse proxy/ingress logs for 101 responses on /api/ws/* and any 400/403/499/504 around the same time
- LB/CDN logs for idle timeout hits
Correlate the timestamp when it drops with any fixed timeout (e.g., 60s, 120s, 300s). If you keep seeing ~300s, that’s your load balancer’s idle timeout politely ejecting you.
Docs, in case you need more bedtime reading:
- Settings: Prefect settings reference
- Networking/security notes: Security settings
- Ingress/Helm (websocket hints): Server HelmAndrew Xue
08/15/2025, 7:33 PMprefect.events.clients - Unable to connect to 'wss://.../api/events/in'. Please check your network settings to ensure websocket connections to the API are allowed. Otherwise event data (including task run data) may be lost. Reason: timed out during opening handshake. Set PREFECT_DEBUG_MODE=1 to see the full error.
(task, pid=261655) Traceback (most recent call last):
(task, pid=261655) File "/home/gcpuser/.cache/uv/archive-v0/mfCFK0KR3mRjsoaUT2Vxj/lib/python3.12/site-packages/websockets/asyncio/client.py", line 541, in __await_impl__
(task, pid=261655) self.connection = await self.create_connection()
(task, pid=261655) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(task, pid=261655) File "/home/gcpuser/.cache/uv/archive-v0/mfCFK0KR3mRjsoaUT2Vxj/lib/python3.12/site-packages/websockets/asyncio/client.py", line 467, in create_connection
(task, pid=261655) _, connection = await loop.create_connection(factory, **kwargs)
(task, pid=261655) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(task, pid=261655) File "/home/gcpuser/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/base_events.py", line 1107, in create_connection
(task, pid=261655) sock = await self._connect_sock(
(task, pid=261655) ^^^^^^^^^^^^^^^^^^^^^^^^^
(task, pid=261655) File "/home/gcpuser/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/base_events.py", line 1010, in _connect_sock
(task, pid=261655) await self.sock_connect(sock, address)
(task, pid=261655) File "/home/gcpuser/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/selector_events.py", line 651, in sock_connect
(task, pid=261655) return await fut
(task, pid=261655) ^^^^^^^^^
(task, pid=261655) asyncio.exceptions.CancelledError
(task, pid=261655)
(task, pid=261655) The above exception was the direct cause of the following exception:
(task, pid=261655)
(task, pid=261655) Traceback (most recent call last):
(task, pid=261655) File "/home/gcpuser/.cache/uv/archive-v0/mfCFK0KR3mRjsoaUT2Vxj/lib/python3.12/site-packages/websockets/asyncio/client.py", line 539, in __await_impl__
(task, pid=261655) async with asyncio_timeout(self.open_timeout):
(task, pid=261655) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(task, pid=261655) File "/home/gcpuser/.local/share/uv/python/cpython-3.12.11-linux-x86_64-gnu/lib/python3.12/asyncio/timeouts.py", line 115, in __aexit__
(task, pid=261655) raise TimeoutError from exc_val
(task, pid=261655) TimeoutError
(task, pid=261655)
(task, pid=261655) The above exception was the direct cause of the following exception:
(task, pid=261655)
(task, pid=261655) Traceback (most recent call last):
(task, pid=261655) File "/home/gcpuser/.cache/uv/archive-v0/mfCFK0KR3mRjsoaUT2Vxj/lib/python3.12/site-packages/prefect/events/clients.py", line 311, in _reconnect
(task, pid=261655) self._websocket = await self._connect.__aenter__()
(task, pid=261655) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(task, pid=261655) File "/home/gcpuser/.cache/uv/archive-v0/mfCFK0KR3mRjsoaUT2Vxj/lib/python3.12/site-packages/websockets/asyncio/client.py", line 587, in __aenter__
(task, pid=261655) return await self
(task, pid=261655) ^^^^^^^^^^
(task, pid=261655) File "/home/gcpuser/.cache/uv/archive-v0/mfCFK0KR3mRjsoaUT2Vxj/lib/python3.12/site-packages/prefect/_internal/websockets.py", line 135, in _proxy_connect
(task, pid=261655) proto = await self.__await_impl__()
(task, pid=261655) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(task, pid=261655) File "/home/gcpuser/.cache/uv/archive-v0/mfCFK0KR3mRjsoaUT2Vxj/lib/python3.12/site-packages/websockets/asyncio/client.py", line 578, in __await_impl__
(task, pid=261655) raise TimeoutError("timed out during opening handshake") from exc
(task, pid=261655) TimeoutError: timed out during opening handshake
Marvin
08/15/2025, 7:33 PMMarvin
08/15/2025, 7:34 PMprefect._internal.websockets._proxy_connect
appears in the traceback. That means your run is attempting to go through a proxy (env like `HTTPS_PROXY`/`HTTP_PROXY`/`ALL_PROXY` is set).
- The failure is TimeoutError: timed out during opening handshake
while reconnecting. So the TCP/TLS/HTTP Upgrade never completes — classic proxy/LB/firewall behavior, not an app crash.
Likely causes
- Corporate proxy that doesn’t support WebSocket upgrades or requires auth
- Proxy/LB idle timeout dropped the first connection; reconnect attempts can’t complete the upgrade
- Proxy/TLS inspection interfering with the TLS handshake
- Wrong/no NO_PROXY
so your traffic is forced through a proxy needlessly
Targeted checks/fixes
1) Check whether a proxy is in play
echo $HTTPS_PROXY
echo $HTTP_PROXY
echo $ALL_PROXY
echo $NO_PROXY
- If you can reach the API directly, bypass the proxy for it:
# include your API host and (if needed) domain
export NO_PROXY="<http://your-api.example.com|your-api.example.com>,.<http://example.com|example.com>,localhost,127.0.0.1"
# or unset the proxy entirely for this process
unset HTTPS_PROXY HTTP_PROXY ALL_PROXY
Then rerun.
2) If you must use a proxy
- Ensure it supports WebSocket (HTTP/1.1 Upgrade) and allows CONNECT to the API host:443.
- If it requires auth, include credentials in HTTPS_PROXY
(or use a proxy config your security folks approve).
- If there’s TLS interception, install the proxy’s CA and point to it:
export SSL_CERT_FILE=/path/to/corp-ca.pem
# often also helpful
export REQUESTS_CA_BUNDLE=$SSL_CERT_FILE
3) Self-hosted behind NGINX/Ingress/LB
- Make sure the events websocket path is upgraded and not subject to short idle timeouts. For NGINX:
location /api/events/ {
proxy_pass <http://prefect-server>; # your upstream
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 600s;
proxy_send_timeout 600s;
}
- For NGINX Ingress:
<http://nginx.ingress.kubernetes.io/proxy-read-timeout|nginx.ingress.kubernetes.io/proxy-read-timeout>: "600"
<http://nginx.ingress.kubernetes.io/proxy-send-timeout|nginx.ingress.kubernetes.io/proxy-send-timeout>: "600"
- For ALB/ELB: increase idle timeout (≥300s). For Cloudflare/CDN: enable WebSocket and avoid short timeouts.
4) Quick repro without Prefect
- From the same machine/network, try:
wscat -c wss://<your-host>/api/events/in
or a tiny Python probe with a longer open timeout. If that times out too, it’s definitely the network/proxy path.
5) Turn up logs to confirm
PREFECT_DEBUG_MODE=1 PREFECT_LOGGING_LEVEL=DEBUG
Look for:
- Proxy errors (403/400), missing Upgrade/Connection headers
- TLS errors (CERTIFICATE_VERIFY_FAILED)
- Repeatable disconnect timing (e.g., ~300s → LB idle timeout)
Docs for reference
- Security and networking notes
- Server/Ingress (websockets)
- Settings reference
If you can share:
- Whether you’re on Prefect Cloud or self‑hosted
- The presence/value of `HTTPS_PROXY`/`NO_PROXY`
- Any proxy/LB in front and the approximate drop timing
…I can suggest exact config lines rather than making you play network whack‑a‑mole.Andrew Xue
08/15/2025, 10:15 PMMarvin
08/15/2025, 10:15 PMMarvin
08/15/2025, 10:16 PMprefect._internal.websockets._proxy_connect
, so a proxy is in play. The handshake is timing out during reconnect — classic proxy/load balancer behavior when under load or with strict timeouts.
What to try first
1) Bypass the proxy for the Prefect API host
echo $HTTPS_PROXY; echo $HTTP_PROXY; echo $ALL_PROXY; echo $NO_PROXY
# If possible, bypass:
export NO_PROXY="your-api.host,.your-domain,localhost,127.0.0.1"
# or temporarily unset to test
unset HTTPS_PROXY HTTP_PROXY ALL_PROXY
If that fixes it, add a proper NO_PROXY permanently.
2) If a proxy is required, raise limits/timeouts and support websockets
- Ensure HTTP/1.1 Upgrade is allowed and idle/keepalive timeouts are generous.
- NGINX example:
location /api/events/ {
proxy_pass <http://prefect-server>;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_read_timeout 600s;
proxy_send_timeout 600s;
}
- Ingress annotations:
<http://nginx.ingress.kubernetes.io/proxy-read-timeout|nginx.ingress.kubernetes.io/proxy-read-timeout>: "600"
<http://nginx.ingress.kubernetes.io/proxy-send-timeout|nginx.ingress.kubernetes.io/proxy-send-timeout>: "600"
- ALB/ELB: increase idle timeout (≥300s). Cloudflare/CDN: enable WebSocket and avoid short timeouts/rate limits.
3) Reduce event volume during the 6k surge
- Cut API logging as a test:
export PREFECT_LOGGING_TO_API=false
export PREFECT_LOGGING_LEVEL=WARNING
- Avoid chatty prints; set log_prints=False
on very busy tasks, and throttle heartbeat logs.
- If you’re emitting logs inside tight loops, batch them.
4) Sanity check from the same host/network
wscat -c wss://<your-prefect-host>/api/events/in
If the handshake stalls there too (especially under load), it’s the path/proxy, not Prefect.
5) If self-hosting, ensure the API/events service isn’t saturated
- Scale the API/events pods, check DB load, and watch for 5xx/4xx or slow upstreams around the time of the drop.
Why it correlates with “6k+ tasks”
- Longer runtime crosses fixed idle/rotation windows on proxies/LBs.
- The event/log burst can trigger proxy throttling, connection limits, or resource contention so the reconnect’s HTTP Upgrade cannot complete within the websocket library’s open_timeout, yielding “timed out during opening handshake”.
If you share:
- Cloud vs self-hosted, the wss host, and whether HTTPS_PROXY/NO_PROXY
are set
- Rough wall-clock time when it drops (e.g., ~300s, ~900s)
- Whether it coincides with a big fan-out/map phase (i.e., a burst of logs/events)
…I’ll give you exact config knobs instead of generalities.
Refs
- Security/networking notes
- Ingress/WebSocket considerations
- Settings reference