<@ULVA73B9P> prefect often return 503 service unav...
# marvin-ai
t
@Marvin prefect often return 503 service unavailable, upto 30% of flow runs
m
thought for 541.6 seconds
Thanks for flagging this — a 30% 503 rate is not expected. Let’s triage and mitigate. A few quick questions to narrow it down: - Are you on Prefect Cloud or self-hosted Prefect Server? - What Prefect version and worker type (e.g., Kubernetes, Docker, Process)? - When do you see 503s (creating runs, setting state, sending logs, listing runs)? - Roughly how many concurrent flow runs and how chatty are your logs? - Any proxy/NAT/load balancer between your workers and the Prefect API? Immediate mitigations you can try - Increase client retries and timeouts (helps smooth transient 5xx) -
PREFECT_CLIENT_MAX_RETRIES=10
-
PREFECT_CLIENT_RETRY_EXTRA_CODES=502,503,504
(ensures these 5xx are retried) -
PREFECT_API_REQUEST_TIMEOUT=120
- Optional:
PREFECT_API_ENABLE_HTTP2=true
(can reduce connection overhead) - Reduce API chatter from logs - Temporarily disable to test:
PREFECT_LOGGING_TO_API_ENABLED=false
- Or batch more conservatively: -
PREFECT_LOGGING_TO_API_BATCH_INTERVAL=5
-
PREFECT_LOGGING_TO_API_BATCH_SIZE=1000000
-
PREFECT_LOGGING_TO_API_MAX_LOG_SIZE=200000
- Smooth concurrency spikes - Cap pool concurrency so runs don’t all start at once: -
prefect work-pool set-concurrency-limit "<pool>"  N
- Stagger large backfills or bursts of scheduled runs. - Check for network intermediaries - Corporate proxies/NATs terminating connections or aggressive idle timeouts can surface as 503s; try keeping connections short-lived or enabling HTTP/2, and verify the proxy logs. How to apply settings - As environment variables (recommended for workers/infra):
Copy code
export PREFECT_CLIENT_MAX_RETRIES=10
  export PREFECT_CLIENT_RETRY_EXTRA_CODES=502,503,504
  export PREFECT_API_REQUEST_TIMEOUT=120
  export PREFECT_API_ENABLE_HTTP2=true
  export PREFECT_LOGGING_TO_API_ENABLED=false
- Or in code for a single run:
Copy code
from prefect.settings import temporary_settings
  from prefect.settings import (
      PREFECT_CLIENT_MAX_RETRIES,
      PREFECT_CLIENT_RETRY_EXTRA_CODES,
      PREFECT_API_REQUEST_TIMEOUT,
      PREFECT_API_ENABLE_HTTP2,
      PREFECT_LOGGING_TO_API_ENABLED,
  )

  with temporary_settings({
      PREFECT_CLIENT_MAX_RETRIES: 10,
      PREFECT_CLIENT_RETRY_EXTRA_CODES: {502, 503, 504},
      PREFECT_API_REQUEST_TIMEOUT: 120.0,
      PREFECT_API_ENABLE_HTTP2: True,
      PREFECT_LOGGING_TO_API_ENABLED: False,
  }):
      # run your flow
      ...
Helpful references - Prefect settings overview: Settings - Logging to API and batching: Logging - Work pools and concurrency: Work Pools - Prefect Cloud status (in case there’s an ongoing incident): Status Page If you can share: - A couple of failing flow run IDs and timestamps - Whether disabling log upload reduced 503s - Your Prefect version and region (for Cloud) …I can dig deeper and help pinpoint whether it’s client-side load, network, or something we should escalate on our side.