< Marvin> prefect often return 503 service unavailable upto Prefect Community #marvin-ai

Join Slack

<@ULVA73B9P> prefect often return 503 service unav...

# marvin-ai

Thành Đạt Vũ

09/15/2025, 2:18 AM

@Marvin prefect often return 503 service unavailable, upto 30% of flow runs

Marvin

09/15/2025, 2:18 AM

✅ thought for 541.6 seconds

Marvin

09/15/2025, 2:27 AM

Thanks for flagging this — a 30% 503 rate is not expected. Let’s triage and mitigate. A few quick questions to narrow it down: - Are you on Prefect Cloud or self-hosted Prefect Server? - What Prefect version and worker type (e.g., Kubernetes, Docker, Process)? - When do you see 503s (creating runs, setting state, sending logs, listing runs)? - Roughly how many concurrent flow runs and how chatty are your logs? - Any proxy/NAT/load balancer between your workers and the Prefect API? Immediate mitigations you can try - Increase client retries and timeouts (helps smooth transient 5xx) -

PREFECT_CLIENT_MAX_RETRIES=10

PREFECT_CLIENT_RETRY_EXTRA_CODES=502,503,504

(ensures these 5xx are retried) -

PREFECT_API_REQUEST_TIMEOUT=120

- Optional:

PREFECT_API_ENABLE_HTTP2=true

(can reduce connection overhead) - Reduce API chatter from logs - Temporarily disable to test:

PREFECT_LOGGING_TO_API_ENABLED=false

- Or batch more conservatively: -

PREFECT_LOGGING_TO_API_BATCH_INTERVAL=5

PREFECT_LOGGING_TO_API_BATCH_SIZE=1000000

PREFECT_LOGGING_TO_API_MAX_LOG_SIZE=200000

- Smooth concurrency spikes - Cap pool concurrency so runs don’t all start at once: -

prefect work-pool set-concurrency-limit "<pool>"  N

- Stagger large backfills or bursts of scheduled runs. - Check for network intermediaries - Corporate proxies/NATs terminating connections or aggressive idle timeouts can surface as 503s; try keeping connections short-lived or enabling HTTP/2, and verify the proxy logs. How to apply settings - As environment variables (recommended for workers/infra):

Copy code

export PREFECT_CLIENT_MAX_RETRIES=10
  export PREFECT_CLIENT_RETRY_EXTRA_CODES=502,503,504
  export PREFECT_API_REQUEST_TIMEOUT=120
  export PREFECT_API_ENABLE_HTTP2=true
  export PREFECT_LOGGING_TO_API_ENABLED=false

- Or in code for a single run:

Copy code

from prefect.settings import temporary_settings
  from prefect.settings import (
      PREFECT_CLIENT_MAX_RETRIES,
      PREFECT_CLIENT_RETRY_EXTRA_CODES,
      PREFECT_API_REQUEST_TIMEOUT,
      PREFECT_API_ENABLE_HTTP2,
      PREFECT_LOGGING_TO_API_ENABLED,
  )

  with temporary_settings({
      PREFECT_CLIENT_MAX_RETRIES: 10,
      PREFECT_CLIENT_RETRY_EXTRA_CODES: {502, 503, 504},
      PREFECT_API_REQUEST_TIMEOUT: 120.0,
      PREFECT_API_ENABLE_HTTP2: True,
      PREFECT_LOGGING_TO_API_ENABLED: False,
  }):
      # run your flow
      ...

Helpful references - Prefect settings overview: Settings - Logging to API and batching: Logging - Work pools and concurrency: Work Pools - Prefect Cloud status (in case there’s an ongoing incident): Status Page If you can share: - A couple of failing flow run IDs and timestamps - Whether disabling log upload reduced 503s - Your Prefect version and region (for Cloud) …I can dig deeper and help pinpoint whether it’s client-side load, network, or something we should escalate on our side.

2 Views

Open in Slack

Previous Next