<@ULVA73B9P> I get 500 Internal Server error fro...
# ask-marvin
d
@Marvin I get 500 Internal Server error from workers and it can't connect to server as pool I user 2 server replicas and redis< on server have next error:
m
thought for 152.5 seconds
d
Copy code
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/prefect/server/models/deployments.py", line 1110, in mark_deployments_ready
    await session.execute(
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/ext/asyncio/session.py", line 463, in execute
    result = await greenlet_spawn(
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 203, in greenlet_spawn
    result = context.switch(value)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2365, in execute
    return self._execute_internal(
m
Thanks for the report — it sounds like your workers are getting 500s from the Prefect API, and your Redis instance is logging an error. With two API replicas, a few deployment/config issues can surface. Can you share: - The exact Redis error message/log snippet - A stack trace or error line from the Prefect API pods around the time of the 500 - Your deployment details: Prefect version, Helm chart version (if using Helm), values for Postgres/Redis (bundled vs external), Ingress/LoadBalancer, and the workers’ environment (where they run, image tag) - The value of
PREFECT_API_URL
in your worker environment In the meantime, here’s a focused checklist to isolate the problem: Quick isolation steps - Scale to one API replica and test: - If 500s disappear, it points to multi-replica configuration (shared DB, load balancer, or Redis). - Hit the health endpoint on each replica: - From inside the cluster:
Copy code
kubectl port-forward svc/<your-api-service> 4200:4200
    curl -i <http://localhost:4200/api/health>
You should get 200 OK. If any replica fails, grab its pod logs. - Verify workers’ API URL: - Ensure
PREFECT_API_URL
points to the correct address your workers can reach (scheme/host/port), e.g.
<https://prefect.example.com/api>
or
<http://prefect-api.default.svc.cluster.local:4200/api>
. - A mismatched or unreachable URL can lead to intermittent errors masked as 500s when going through an Ingress. Common multi-replica pitfalls to check - Shared database for all API replicas: - All replicas must use the same Postgres. If any replica is using a local/ephemeral DB (e.g., SQLite) or a different Postgres, you will see inconsistent state and 500s. - Check API logs for database errors like connection failures, pool exhaustion, or migration errors. - Database connection limits: - With more replicas/workers you may exhaust Postgres connections. Symptoms: 500s with “too many connections” or “OperationalError”. - Increase Postgres
max_connections
and/or tune the API DB pool. Also verify migrations have run on startup without error. - Ingress/load balancer config: - For API calls, sticky sessions are generally not required, but make sure: - Path routing is correct and not rewriting
/api
unexpectedly - Idle timeouts aren’t too low (long polling from workers can be cut off) - TLS termination is correct; avoid HTTP→HTTPS redirect loops - Version mismatch: - Ensure your API image tag and worker image tag are the same Prefect version. Mismatches can lead to unexpected 500s due to schema/API changes. Redis-specific checks - Confirm the Redis connection details your API uses and that both API replicas can reach Redis on the same host/port with the correct authentication/TLS. - Check for common Redis errors: - Authentication failures (WRONGPASS) - Connection timeouts/refused connections - Memory pressure (evictions), or maxmemory policy kicking in - From an API pod, exec in and test connectivity:
Copy code
kubectl exec -it deploy/<your-api-deployment> -- sh
  # If redis-cli is available
  redis-cli -h <redis-host> -p <port> ping
- If you’re using the Prefect Helm chart’s Redis, verify its values and service name and that it’s not being scheduled with insufficient resources. How to gather the right logs quickly - API pods (including both replicas):
Copy code
kubectl get pods -l <http://app.kubernetes.io/name=prefect-server|app.kubernetes.io/name=prefect-server>
  kubectl logs <api-pod-name>
- Redis:
Copy code
kubectl logs <redis-pod-name>
- Worker logs around the time of the 500s. If you can paste the Redis error and one API stack trace, I can pinpoint the root cause much faster. References - Self-hosted overview: Self-hosted Prefect (3.x) - Helm deployment guide: Prefect Server Helm - Health check endpoint: /api/health docs
- Redis integration note: prefect-redis (if you’re using Redis with Prefect components) Assuming you’re on Prefect 3.x; if not, let me know and I’ll adjust guidance.
d
@Marvin on worker I have :
File "/usr/local/lib/python3.11/site-packages/prefect/client/base.py", line 361, in send
response.raise_for_status()
File "/usr/local/lib/python3.11/site-packages/prefect/client/base.py", line 162, in raise_for_status
raise PrefectHTTPStatusError.from_httpx_error(exc) from exc.__cause__
prefect.exceptions.PrefectHTTPStatusError: Server error '500 Internal Server Error' for url '<http://prefect-server.prefect.svc.cluster.local:4200/api/work_pools/>'
Response: {'exception_message': 'Internal Server Error'}
For more information check: <https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500>
m
thought for 124.7 seconds
Thanks for the details — the URL in your traceback is a strong clue: Your worker is calling: http://prefect-server.prefect.svc.cluster.local:4200/api/work_pools/ In Prefect 3.x, the endpoint is hyphenated: http://…/api/work-pools Underscore vs hyphen is the giveaway. This usually means you have a version mismatch: a Prefect 2.x worker (underscore endpoints) talking to a Prefect 3.x server (hyphen endpoints). That mismatch will produce 404/500s on core endpoints like work pools. What to do next 1) Confirm versions on both sides - In the worker container:
Copy code
prefect version
  python -c "import prefect; print(prefect.__version__)"
- For the server, check the image tag you deployed (Helm values or deployment manifest) and/or the API pod logs on startup (it prints the Prefect version). 2) Align versions (recommended: 3.x everywhere) - Ensure your server and all workers use the same major version (ideally the same minor/patch too). - If you’re using Helm, set the server image tag to a 3.x release and rebuild/redeploy workers on the same 3.x version. - If you must remain on 2.x for now, make sure the server is also 2.x and point workers at that 2.x API. 3) Quick validation after alignment - Health check:
Copy code
curl -i <http://prefect-server.prefect.svc.cluster.local:4200/api/health>
- Work pools (3.x):
Copy code
curl -i <http://prefect-server.prefect.svc.cluster.local:4200/api/work-pools>
If you still see underscores in any requests after upgrading, the worker did not pick up the new image/lib. Docs for reference - Work pools API (3.x): /api/work-pools endpoints About Redis The Redis error you mentioned could be a separate issue or a side-effect of the API failing. Once versions are aligned, if you still see Redis errors, please paste the exact Redis log line(s) and I’ll help triage. Also share: - Prefect versions (server and worker) - Where workers run (image tag/requirements) - The value of
PREFECT_API_URL
used by workers My bet is the version mismatch is the root cause of the 500s you’re seeing on the work pool calls.
d
@Marvin when we disabled REDIS this error 500 from workers go away, and workers start joining correctly
m
thought for 231.6 seconds
Got it — that’s a very useful data point. If disabling Redis makes the 500s disappear and workers can join, it strongly suggests the API pods are erroring when they try to use Redis. Likely root causes - Connectivity/auth/TLS issues to Redis: API replicas can’t reach Redis, wrong password, or TLS mismatch (redis:// vs rediss://). - Integration not installed: your server image may not have the prefect-redis integration installed even though it’s configured to use Redis, leading to an ImportError at request time. - Misconfigured feature toggle: if you enabled a Redis-backed feature (e.g., distributed locks/caching/rate limits) the server may touch Redis on most requests, causing unrelated endpoints (like work pools) to 500 when Redis fails. - Multi-replica inconsistency: replicas pointing at different Redis URLs or env settings can lead to intermittent 500s via the load balancer. How to pinpoint and fix 1) Capture the server-side error - Re-enable Redis and set API logging to DEBUG so we can see the stack trace: - If using Helm, add an env var to the API pods like:
Copy code
PREFECT_LOGGING_LEVEL=DEBUG
- Then watch the API pod logs around a worker join attempt. Please paste the stack trace you see. 2) Verify Redis from inside an API pod - Exec into an API pod and test connectivity:
Copy code
# If redis-cli available:
  redis-cli -h <host> -p <port> ping
  # Or from Python:
  python - <<'PY'
import os, redis
r = redis.Redis.from_url(os.environ.get("REDIS_URL", "redis://<host>:<port>"))
print(r.ping())
PY
- Check that: - The URL scheme matches your server (redis:// for plaintext, rediss:// for TLS). - Auth is included if required (password/user). - Network policies and service names resolve from the API namespace. 3) Ensure the integration is installed in the server image - If you configured Prefect to use Redis, the API image needs the integration:
Copy code
pip install prefect-redis
- Rebuild/redeploy your Prefect Server image with this dependency if it’s missing. 4) Make configuration consistent across replicas - Confirm both API replicas have identical Redis-related settings (same URL/creds/TLS). - If you’re using a Secret for the URL, make sure it’s mounted in both replicas. - Scale to 1 replica, test, then scale back to 2 to isolate LB/replica issues. 5) Sanity checks - Hit the health endpoint while Redis is enabled:
Copy code
curl -i <http://prefect-server.prefect.svc.cluster.local:4200/api/health>
If that 500s, the server likely fails on a global dependency init that touches Redis. - Verify server/worker Prefect versions match; mismatches can surface as confusing errors. If Redis isn’t required for your use case, it’s perfectly fine to run Prefect 3.x without it. If you do want Redis-backed features, share: - The Redis error lines from your Redis pod and the API pod stack trace with Redis enabled - How you wired Redis into the server (Helm values/envs or code) - Whether your server image includes the prefect-redis package I can then point to the exact fix. References: - Self-hosted overview: Self-hosted Prefect (3.x) - Health check endpoint: /api/health - Redis integration: prefect-redis
d
@Marvin there what we added/disabled # - name: PREFECT_MESSAGING_BROKER # value: "prefect_redis.messaging" # - name: PREFECT_MESSAGING_CACHE # value: "prefect_redis.messaging" # - name: PREFECT_REDIS_MESSAGING_HOST # value: "valkey" # - name: PREFECT_REDIS_MESSAGING_PORT # value: "6379" # - name: PREFECT_REDIS_MESSAGING_DB # value: "0" # - name: PREFECT_SERVER_EVENTS_CAUSAL_ORDERING # value: "prefect_redis.ordering" (edited)
m
thought for 253.9 seconds
Thanks — this helps a lot. A few things stand out in what you toggled: What looks right - PREFECT_MESSAGING_BROKER=prefect_redis.messaging - PREFECT_MESSAGING_CACHE=prefect_redis.messaging - PREFECT_SERVER_EVENTS_CAUSAL_ORDERING=prefect_redis.ordering These are the correct module paths to switch Prefect’s messaging/cache and event ordering to the Redis-backed implementations from prefect-redis. What likely caused the 500s 1) Missing integration in the server image - If your Prefect Server image does not include the redis integration, importing prefect_redis.* will fail and can surface as 500s on otherwise unrelated endpoints. - Fix: ensure your API image installs the integration:
Copy code
pip install "prefect[redis]"
  # or
  pip install prefect-redis
Then restart the API pods. 2) Redis/Valkey connection settings - Your HOST value snippet looks like it included a Slack-formatted link:
"<http://stg-prefect-...amazonaws.com|v>alkey"
. In Kubernetes envs, this must be just the hostname, no scheme and no Slack markup. For example:
Copy code
PREFECT_REDIS_MESSAGING_HOST=<http://stg-prefect-valkey-pnjrp2.serverless.use2.cache.amazonaws.com|stg-prefect-valkey-pnjrp2.serverless.use2.cache.amazonaws.com>
- TLS: AWS ElastiCache Serverless for Redis/Valkey typically requires TLS. You must enable SSL in the client. - With prefect-redis, that means setting SSL true. Assuming the settings map to env vars:
Copy code
PREFECT_REDIS_MESSAGING_SSL=true
- If auth is enabled on your Valkey/ElastiCache, set:
Copy code
PREFECT_REDIS_MESSAGING_USERNAME=<user-if-required>
    PREFECT_REDIS_MESSAGING_PASSWORD=<token-or-password>
- Port/DB look fine:
Copy code
PREFECT_REDIS_MESSAGING_PORT=6379
  PREFECT_REDIS_MESSAGING_DB=0
- Networking: ensure the API pods can resolve and reach the Valkey endpoint (VPC/network policy, DNS). Recommended minimal config (Kubernetes env vars) to test
Copy code
- name: PREFECT_MESSAGING_BROKER
  value: "prefect_redis.messaging"
- name: PREFECT_MESSAGING_CACHE
  value: "prefect_redis.messaging"
- name: PREFECT_SERVER_EVENTS_CAUSAL_ORDERING
  value: "prefect_redis.ordering"

- name: PREFECT_REDIS_MESSAGING_HOST
  value: "<http://stg-prefect-valkey-pnjrp2.serverless.use2.cache.amazonaws.com|stg-prefect-valkey-pnjrp2.serverless.use2.cache.amazonaws.com>"
- name: PREFECT_REDIS_MESSAGING_PORT
  value: "6379"
- name: PREFECT_REDIS_MESSAGING_DB
  value: "0"
- name: PREFECT_REDIS_MESSAGING_SSL
  value: "true"
# Uncomment if your Valkey/ElastiCache requires auth
# - name: PREFECT_REDIS_MESSAGING_USERNAME
#   valueFrom: { secretKeyRef: { name: redis-creds, key: username } }
# - name: PREFECT_REDIS_MESSAGING_PASSWORD
#   valueFrom: { secretKeyRef: { name: redis-creds, key: password } }
Connectivity test from an API pod
Copy code
kubectl exec -it deploy/<your-api-deployment> -- sh -lc '
python - <<PY
import os
from prefect_redis.client import get_async_redis_client
import anyio

async def main():
    r = await get_async_redis_client(
        host=os.getenv("PREFECT_REDIS_MESSAGING_HOST"),
        port=int(os.getenv("PREFECT_REDIS_MESSAGING_PORT", "6379")),
        db=int(os.getenv("PREFECT_REDIS_MESSAGING_DB", "0")),
        username=os.getenv("PREFECT_REDIS_MESSAGING_USERNAME"),
        password=os.getenv("PREFECT_REDIS_MESSAGING_PASSWORD"),
        ssl=os.getenv("PREFECT_REDIS_MESSAGING_SSL", "false").lower() == "true",
        health_check_interval=30,
    )
    print(await r.ping())
    await r.close()
anyio.run(main)
PY
'
You should see
True
. If this fails, the same failure is likely causing the 500s. Staged rollout suggestion - First enable broker/cache only; keep ordering in memory:
Copy code
PREFECT_MESSAGING_BROKER=prefect_redis.messaging
  PREFECT_MESSAGING_CACHE=prefect_redis.messaging
  # Do NOT set PREFECT_SERVER_EVENTS_CAUSAL_ORDERING yet
Verify workers can join and API endpoints work. - Then enable PREFECT_SERVER_EVENTS_CAUSAL_ORDERING=prefect_redis.ordering and retest. This helps isolate which Redis-backed component is failing. Also, please double-check versions
- Run
prefect version
in the worker container and confirm the API image tag. Both should be the same Prefect major (ideally same minor/patch). - Earlier your stack trace showed an underscore endpoint (
/api/work_pools/
), which is 2.x style; 3.x uses hyphens (
/api/work-pools
). If the Redis fix makes things work, great — but let’s still confirm versions to avoid a later surprise. If you can re-enable Redis and share: - The API pod stack trace when the 500 occurs (set
PREFECT_LOGGING_LEVEL=DEBUG
) - Whether your server image has
prefect-redis
installed - Whether Valkey requires auth on your endpoint …I can zero in on the exact fix. References - Settings reference: Prefect 3 settings - Redis integration package: prefect-redis