<@ULVA73B9P> I'm running Prefect on a RunPod insta...
# marvin-ai
s
@Marvin I'm running Prefect on a RunPod instance. When I'm running a task that takes a bit of time (say, 20+ minutes), sometimes I get a worker pool crash: + Exception Group Traceback (most recent call last): | File "/usr/local/lib/python3.11/dist-packages/prefect/cli/_utilities.py", line 44, in wrapper | return fn(*args, **kwargs) | ^^^^^^^^^^^^^^^^^^^ | File "/usr/local/lib/python3.11/dist-packages/prefect/cli/_types.py", line 156, in sync_fn | return asyncio.run(async_fn(*args, **kwargs)) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/usr/lib/python3.11/asyncio/runners.py", line 190, in run | return runner.run(main) | ^^^^^^^^^^^^^^^^ | File "/usr/lib/python3.11/asyncio/runners.py", line 118, in run | return self._loop.run_until_complete(task) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/usr/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete | return future.result() | ^^^^^^^^^^^^^^^ | File "/usr/local/lib/python3.11/dist-packages/prefect/cli/worker.py", line 168, in start | await worker.start( | File "/usr/local/lib/python3.11/dist-packages/prefect/workers/process.py", line 158, in start | async with self as worker: | File "/usr/local/lib/python3.11/dist-packages/prefect/workers/process.py", line 320, in aexit | await super().__aexit__(*exc_info) | File "/usr/local/lib/python3.11/dist-packages/prefect/workers/base.py", line 1600, in aexit | raise exceptions[0] from None | File "/usr/lib/python3.11/contextlib.py", line 728, in aexit | cb_suppress = await cb(*exc_details) | ^^^^^^^^^^^^^^^^^^^^^^ | File "/usr/local/lib/python3.11/dist-packages/prefect/runner/runner.py", line 1610, in aexit | await self._exit_stack.__aexit__(*exc_info) | File "/usr/lib/python3.11/contextlib.py", line 745, in aexit | raise exc_details[1] | File "/usr/lib/python3.11/contextlib.py", line 728, in aexit | cb_suppress = await cb(*exc_details) | ^^^^^^^^^^^^^^^^^^^^^^ | File "/usr/local/lib/python3.11/dist-packages/anyio/_backends/_asyncio.py", line 736, in aexit | raise BaseExceptionGroup( | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception) +-+---------------- 1 ---------------- | Exception Group Traceback (most recent call last): | File "/usr/local/lib/python3.11/dist-packages/prefect/workers/process.py", line 162, in start | async with anyio.create_task_group() as loops_task_group: | File "/usr/local/lib/python3.11/dist-packages/anyio/_backends/_asyncio.py", line 736, in aexit | raise BaseExceptionGroup( | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception) +-+---------------- 1 ---------------- | Traceback (most recent call last): | File "/usr/local/lib/python3.11/dist-packages/prefect/utilities/services.py", line 64, in critical_service_loop | await workload() | File "/usr/local/lib/python3.11/dist-packages/prefect/workers/base.py", line 1107, in sync_with_backend | await self._update_local_work_pool_info() | File "/usr/local/lib/python3.11/dist-packages/prefect/workers/base.py", line 1004, in _update_local_work_pool_info | work_pool = await self._client.create_work_pool(work_pool=wp) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/usr/local/lib/python3.11/dist-packages/prefect/client/orchestration/_work_pools/client.py", line 478, in create_work_pool | response = await self.request( | ^^^^^^^^^^^^^^^^^^^ | File "/usr/local/lib/python3.11/dist-packages/prefect/client/orchestration/base.py", line 53, in request | return await self._client.send(request) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | File "/usr/local/lib/python3.11/dist-packages/prefect/client/base.py", line 379, in send | response.raise_for_status() | File "/usr/local/lib/python3.11/dist-packages/prefect/client/base.py", line 163, in raise_for_status | raise PrefectHTTPStatusError.from_httpx_error(exc) from exc.cause | prefect.exceptions.PrefectHTTPStatusError: Client error '404 Not Found' for url 'https://rdzs18c4k9ub52-4200.proxy.runpod.net/api/work_pools/' | For more information check: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/404 What can be causing this?
m
thought for 315.3 seconds
Short version: a 404 on /api/work_pools during a long run almost always points to your worker losing a valid route to the Prefect API (proxy/path mismatch, server restart, or DB reset), not to the task itself. On RunPod this is commonly due to the HTTP proxy or the server being restarted or idled, or the API path not being forwarded consistently. What the trace shows - The worker’s background “sync with backend” loop tried to ensure the work pool exists and attempted POST https://...-4200.proxy.runpod.net/api/work_pools/. - It got a 404 on the collection route itself. That usually means “the Prefect API is not actually behind this path right now” rather than “that pool does not exist”. - This tends to surface after some time when a proxy/port mapping has recycled or the server has restarted/reset state. Most common causes on self-hosted behind a proxy - Proxy path/prefix mismatch: The external URL must map /api to the Prefect API. If the proxy is already adding /api, setting PREFECT_API_URL to .../api will produce /api/api/... upstream and 404. If the proxy does not add /api, PREFECT_API_URL must include it. - RunPod proxy idle/rotation: Idle connections or port routes can be recycled; the next call hits a target that is not your Prefect API, returning 404. - Server restart with non-persistent DB: If your Prefect Server (SQLite) restarts without a persistent volume, your work pool is gone. The worker tries to re-create it, but if at that moment the proxy is not routing correctly, you’ll see a 404 on the route. - API URL misconfiguration or version mismatch: Clients pointing at the UI or wrong port/path will sometimes work until a restart, then 404 on API routes. What to check 1) Verify PREFECT_API_URL on the worker - It should point to your external API root, typically …proxy.runpod.net/api - Show config:
Copy code
prefect config view --show-sources --show-defaults
2) Curl the API while a run is in progress - From the worker machine:
Copy code
curl -i https://<your>-<http://4200.proxy.runpod.net/api/work_pools/|4200.proxy.runpod.net/api/work_pools/>
You should get 200 with JSON. If it’s 404, the proxy is not sending /api to the Prefect API. 3) Persist the server DB - If you’re running Prefect Server with SQLite, ensure the DB path is on a persistent volume so a server restart does not drop your work pool. 4) Keep the proxy connection warm - Lower the worker sync interval to generate more frequent backend calls:
Copy code
export PREFECT_WORKER_QUERY_SECONDS=10
This can mitigate proxy idle timeouts. 5) Add a healthcheck to observe worker health during long runs
Copy code
prefect worker start --pool <pool-name> --type process --with-healthcheck
Then hit the local /health endpoint to see if it’s going unhealthy while your task runs. 6) Upgrade to the latest 3.x - We continually improve network resiliency. Make sure both server and worker are on a current 3.x patch. If you can share a bit more, I can be more specific: - Prefect version on both server and worker - The exact value of PREFECT_API_URL - How your Prefect Server is deployed on RunPod (docker? bare process?) and whether the DB is persisted - Any reverse proxy config in front (beyond the RunPod port proxy) Related docs - Worker concepts: https://docs-3.prefect.io/v3/concepts/workers - Worker healthchecks: https://docs-3.prefect.io/v3/advanced/worker-healthchecks - Server concepts: https://docs-3.prefect.io/v3/concepts/server - REST API (work pools): https://docs-3.prefect.io/v3/api-ref/rest-api/server/work-pools/read-work-pools Quick “known good” baseline - Set and verify:
Copy code
export PREFECT_API_URL=https://<your>-<http://4200.proxy.runpod.net/api|4200.proxy.runpod.net/api>
  export PREFECT_LOGGING_LEVEL=DEBUG
  export PREFECT_WORKER_QUERY_SECONDS=10
  prefect config view --show-sources
- Start a process worker (auto-creates the pool if needed):
Copy code
prefect worker start --pool <pool-name> --type process --with-healthcheck
If you still see a 404 on /api/work_pools/ after that, it’s almost certainly the RunPod proxy mapping or a server restart wiping state.
s
@Marvin Assuming it's the RunPod proxy mapping thing, is there a workaround in Prefect to prevent it from crashing when that happens?
m
thought for 113.6 seconds
Short answer: there isn’t a single “don’t crash on proxy flip” toggle in Prefect today. A 404 at the API root is treated as fatal and the worker exits. But you can make the worker resilient enough that transient RunPod proxy/path flips don’t take you down. Recommended mitigations you can apply in Prefect 3.x: - Bypass the external proxy if possible - If your worker and Prefect Server run on the same RunPod, point the worker at the internal API directly instead of the public proxy:
Copy code
export PREFECT_API_URL=<http://localhost:4200/api>
or the Docker network service name if using compose (e.g., `http://prefect-server:4200/api`). This removes the proxy from the path. - Make the client auto-retry 404s and gateway errors - Prefect’s client can retry on additional status codes. Add 404 (and typical gateway errors) and raise the retry limit:
Copy code
export PREFECT_CLIENT_MAX_RETRIES=30
    export PREFECT_CLIENT_RETRY_EXTRA_CODES=404,502,503,504
    export PREFECT_CLIENT_RETRY_JITTER_FACTOR=0.2
Then restart the worker. This usually rides out brief proxy remaps. - Keep the connection warm to avoid proxy idle timeouts - Make the worker touch the API more often:
Copy code
export PREFECT_WORKER_QUERY_SECONDS=10
This reduces the chance the proxy closes the idle route. - Use healthchecks and an automatic restart policy - Start the worker with a health endpoint and run it under a supervisor that restarts on exit:
Copy code
prefect worker start --pool <pool-name> --type process --with-healthcheck
- Examples: - Docker:
--restart unless-stopped
- Systemd:
Restart=always
- Simple loop:
Copy code
while true; do prefect worker start --pool <pool-name> --type process; sleep 2; done
- Persist your server state - If the server restarts without a persistent DB, your work pool disappears. Mount the SQLite DB (or use Postgres) so a restart doesn’t wipe pools. If you can’t bypass the proxy, the retry + keepalive combo is the most “in-Prefect” workaround to prevent transient 404s from taking down the worker. References: - Workers: https://docs-3.prefect.io/v3/concepts/workers - Worker healthchecks: https://docs-3.prefect.io/v3/advanced/worker-healthchecks - Server: https://docs-3.prefect.io/v3/concepts/server If you share your current PREFECT_API_URL and how the server is exposed on RunPod, I can suggest the safest exact values for your setup.
s
@Marvin If I do this: export PREFECT_API_URL=http://localhost:4200/api Then I can't use the UI from outside the RunPod network (i.e. from my home PC) Is there a way to make the API accessible both via the proxy and via localhost, but force the worker pool to connect via localhost?
m
thought for 112.0 seconds
Yes. PREFECT_API_URL is a client-side setting, so you can point your worker at localhost while keeping the UI/API exposed via the RunPod proxy for your browser. The trick is to use a separate Prefect profile (or just per-process env vars) for the worker. Option A: Use a dedicated profile for the worker - Create and set a profile that uses localhost for the API:
Copy code
prefect profile create worker-local
  prefect --profile worker-local config set PREFECT_API_URL=<http://localhost:4200/api>
Optional resilience tweaks:
Copy code
prefect --profile worker-local config set PREFECT_CLIENT_MAX_RETRIES=30
  prefect --profile worker-local config set PREFECT_CLIENT_RETRY_EXTRA_CODES=404,502,503,504
  prefect --profile worker-local config set PREFECT_WORKER_QUERY_SECONDS=10
- Start the worker with that profile:
Copy code
prefect --profile worker-local worker start --pool <pool-name> --type process --with-healthcheck
Option B: Set env vars only for the worker process - In the shell or service that launches the worker:
Copy code
export PREFECT_API_URL=<http://localhost:4200/api>
  export PREFECT_CLIENT_MAX_RETRIES=30
  export PREFECT_CLIENT_RETRY_EXTRA_CODES=404,502,503,504
  export PREFECT_WORKER_QUERY_SECONDS=10

  prefect worker start --pool <pool-name> --type process --with-healthcheck
- Everywhere else (your laptop, other machines) continue using the proxy URL (e.g., https://<your-runpod-proxy>/api) for PREFECT_API_URL, or just access the UI via the proxy in your browser. Notes - Prefect does not support multiple fallback API URLs in a single process; using per-process profiles/env is the supported way. - Ensure your Prefect Server listens on 0.0.0.0:4200 inside RunPod so it’s reachable by both the proxy and localhost on the pod. - If you’re using Docker Compose, set these environment variables only on the worker service, while keeping the server/UI service configured for the proxy. Docs for reference: - Workers: https://docs-3.prefect.io/v3/concepts/workers - Worker healthchecks: https://docs-3.prefect.io/v3/advanced/worker-healthchecks - Server: https://docs-3.prefect.io/v3/concepts/server