Denys Y
09/02/2025, 10:22 AMMarvin
09/02/2025, 10:22 AMDenys Y
09/02/2025, 10:22 AMTraceback (most recent call last):
File "/usr/local/lib/python3.12/site-packages/prefect/server/models/deployments.py", line 1110, in mark_deployments_ready
await session.execute(
File "/usr/local/lib/python3.12/site-packages/sqlalchemy/ext/asyncio/session.py", line 463, in execute
result = await greenlet_spawn(
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sqlalchemy/util/_concurrency_py3k.py", line 203, in greenlet_spawn
result = context.switch(value)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2365, in execute
return self._execute_internal(
Marvin
09/02/2025, 10:24 AMPREFECT_API_URL
in your worker environment
In the meantime, here’s a focused checklist to isolate the problem:
Quick isolation steps
- Scale to one API replica and test:
- If 500s disappear, it points to multi-replica configuration (shared DB, load balancer, or Redis).
- Hit the health endpoint on each replica:
- From inside the cluster:
kubectl port-forward svc/<your-api-service> 4200:4200
curl -i <http://localhost:4200/api/health>
You should get 200 OK. If any replica fails, grab its pod logs.
- Verify workers’ API URL:
- Ensure PREFECT_API_URL
points to the correct address your workers can reach (scheme/host/port), e.g. <https://prefect.example.com/api>
or <http://prefect-api.default.svc.cluster.local:4200/api>
.
- A mismatched or unreachable URL can lead to intermittent errors masked as 500s when going through an Ingress.
Common multi-replica pitfalls to check
- Shared database for all API replicas:
- All replicas must use the same Postgres. If any replica is using a local/ephemeral DB (e.g., SQLite) or a different Postgres, you will see inconsistent state and 500s.
- Check API logs for database errors like connection failures, pool exhaustion, or migration errors.
- Database connection limits:
- With more replicas/workers you may exhaust Postgres connections. Symptoms: 500s with “too many connections” or “OperationalError”.
- Increase Postgres max_connections
and/or tune the API DB pool. Also verify migrations have run on startup without error.
- Ingress/load balancer config:
- For API calls, sticky sessions are generally not required, but make sure:
- Path routing is correct and not rewriting /api
unexpectedly
- Idle timeouts aren’t too low (long polling from workers can be cut off)
- TLS termination is correct; avoid HTTP→HTTPS redirect loops
- Version mismatch:
- Ensure your API image tag and worker image tag are the same Prefect version. Mismatches can lead to unexpected 500s due to schema/API changes.
Redis-specific checks
- Confirm the Redis connection details your API uses and that both API replicas can reach Redis on the same host/port with the correct authentication/TLS.
- Check for common Redis errors:
- Authentication failures (WRONGPASS)
- Connection timeouts/refused connections
- Memory pressure (evictions), or maxmemory policy kicking in
- From an API pod, exec in and test connectivity:
kubectl exec -it deploy/<your-api-deployment> -- sh
# If redis-cli is available
redis-cli -h <redis-host> -p <port> ping
- If you’re using the Prefect Helm chart’s Redis, verify its values and service name and that it’s not being scheduled with insufficient resources.
How to gather the right logs quickly
- API pods (including both replicas):
kubectl get pods -l <http://app.kubernetes.io/name=prefect-server|app.kubernetes.io/name=prefect-server>
kubectl logs <api-pod-name>
- Redis:
kubectl logs <redis-pod-name>
- Worker logs around the time of the 500s.
If you can paste the Redis error and one API stack trace, I can pinpoint the root cause much faster.
References
- Self-hosted overview: Self-hosted Prefect (3.x)
- Helm deployment guide: Prefect Server Helm
- Health check endpoint: /api/health docsMarvin
09/02/2025, 10:24 AMDenys Y
09/02/2025, 10:26 AMFile "/usr/local/lib/python3.11/site-packages/prefect/client/base.py", line 361, in send
response.raise_for_status()
File "/usr/local/lib/python3.11/site-packages/prefect/client/base.py", line 162, in raise_for_status
raise PrefectHTTPStatusError.from_httpx_error(exc) from exc.__cause__
prefect.exceptions.PrefectHTTPStatusError: Server error '500 Internal Server Error' for url '<http://prefect-server.prefect.svc.cluster.local:4200/api/work_pools/>'
Response: {'exception_message': 'Internal Server Error'}
For more information check: <https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/500>
Marvin
09/02/2025, 10:31 AMMarvin
09/02/2025, 10:33 AMprefect version
python -c "import prefect; print(prefect.__version__)"
- For the server, check the image tag you deployed (Helm values or deployment manifest) and/or the API pod logs on startup (it prints the Prefect version).
2) Align versions (recommended: 3.x everywhere)
- Ensure your server and all workers use the same major version (ideally the same minor/patch too).
- If you’re using Helm, set the server image tag to a 3.x release and rebuild/redeploy workers on the same 3.x version.
- If you must remain on 2.x for now, make sure the server is also 2.x and point workers at that 2.x API.
3) Quick validation after alignment
- Health check:
curl -i <http://prefect-server.prefect.svc.cluster.local:4200/api/health>
- Work pools (3.x):
curl -i <http://prefect-server.prefect.svc.cluster.local:4200/api/work-pools>
If you still see underscores in any requests after upgrading, the worker did not pick up the new image/lib.
Docs for reference
- Work pools API (3.x): /api/work-pools endpoints
About Redis
The Redis error you mentioned could be a separate issue or a side-effect of the API failing. Once versions are aligned, if you still see Redis errors, please paste the exact Redis log line(s) and I’ll help triage. Also share:
- Prefect versions (server and worker)
- Where workers run (image tag/requirements)
- The value of PREFECT_API_URL
used by workers
My bet is the version mismatch is the root cause of the 500s you’re seeing on the work pool calls.Denys Y
09/02/2025, 2:12 PMMarvin
09/02/2025, 2:12 PMMarvin
09/02/2025, 2:16 PMPREFECT_LOGGING_LEVEL=DEBUG
- Then watch the API pod logs around a worker join attempt. Please paste the stack trace you see.
2) Verify Redis from inside an API pod
- Exec into an API pod and test connectivity:
# If redis-cli available:
redis-cli -h <host> -p <port> ping
# Or from Python:
python - <<'PY'
import os, redis
r = redis.Redis.from_url(os.environ.get("REDIS_URL", "redis://<host>:<port>"))
print(r.ping())
PY
- Check that:
- The URL scheme matches your server (redis:// for plaintext, rediss:// for TLS).
- Auth is included if required (password/user).
- Network policies and service names resolve from the API namespace.
3) Ensure the integration is installed in the server image
- If you configured Prefect to use Redis, the API image needs the integration:
pip install prefect-redis
- Rebuild/redeploy your Prefect Server image with this dependency if it’s missing.
4) Make configuration consistent across replicas
- Confirm both API replicas have identical Redis-related settings (same URL/creds/TLS).
- If you’re using a Secret for the URL, make sure it’s mounted in both replicas.
- Scale to 1 replica, test, then scale back to 2 to isolate LB/replica issues.
5) Sanity checks
- Hit the health endpoint while Redis is enabled:
curl -i <http://prefect-server.prefect.svc.cluster.local:4200/api/health>
If that 500s, the server likely fails on a global dependency init that touches Redis.
- Verify server/worker Prefect versions match; mismatches can surface as confusing errors.
If Redis isn’t required for your use case, it’s perfectly fine to run Prefect 3.x without it. If you do want Redis-backed features, share:
- The Redis error lines from your Redis pod and the API pod stack trace with Redis enabled
- How you wired Redis into the server (Helm values/envs or code)
- Whether your server image includes the prefect-redis package
I can then point to the exact fix. References:
- Self-hosted overview: Self-hosted Prefect (3.x)
- Health check endpoint: /api/health
- Redis integration: prefect-redisDenys Y
09/02/2025, 2:17 PMMarvin
09/02/2025, 2:17 PMMarvin
09/02/2025, 2:21 PMpip install "prefect[redis]"
# or
pip install prefect-redis
Then restart the API pods.
2) Redis/Valkey connection settings
- Your HOST value snippet looks like it included a Slack-formatted link: "<http://stg-prefect-...amazonaws.com|v>alkey"
. In Kubernetes envs, this must be just the hostname, no scheme and no Slack markup. For example:
PREFECT_REDIS_MESSAGING_HOST=<http://stg-prefect-valkey-pnjrp2.serverless.use2.cache.amazonaws.com|stg-prefect-valkey-pnjrp2.serverless.use2.cache.amazonaws.com>
- TLS: AWS ElastiCache Serverless for Redis/Valkey typically requires TLS. You must enable SSL in the client.
- With prefect-redis, that means setting SSL true. Assuming the settings map to env vars:
PREFECT_REDIS_MESSAGING_SSL=true
- If auth is enabled on your Valkey/ElastiCache, set:
PREFECT_REDIS_MESSAGING_USERNAME=<user-if-required>
PREFECT_REDIS_MESSAGING_PASSWORD=<token-or-password>
- Port/DB look fine:
PREFECT_REDIS_MESSAGING_PORT=6379
PREFECT_REDIS_MESSAGING_DB=0
- Networking: ensure the API pods can resolve and reach the Valkey endpoint (VPC/network policy, DNS).
Recommended minimal config (Kubernetes env vars) to test
- name: PREFECT_MESSAGING_BROKER
value: "prefect_redis.messaging"
- name: PREFECT_MESSAGING_CACHE
value: "prefect_redis.messaging"
- name: PREFECT_SERVER_EVENTS_CAUSAL_ORDERING
value: "prefect_redis.ordering"
- name: PREFECT_REDIS_MESSAGING_HOST
value: "<http://stg-prefect-valkey-pnjrp2.serverless.use2.cache.amazonaws.com|stg-prefect-valkey-pnjrp2.serverless.use2.cache.amazonaws.com>"
- name: PREFECT_REDIS_MESSAGING_PORT
value: "6379"
- name: PREFECT_REDIS_MESSAGING_DB
value: "0"
- name: PREFECT_REDIS_MESSAGING_SSL
value: "true"
# Uncomment if your Valkey/ElastiCache requires auth
# - name: PREFECT_REDIS_MESSAGING_USERNAME
# valueFrom: { secretKeyRef: { name: redis-creds, key: username } }
# - name: PREFECT_REDIS_MESSAGING_PASSWORD
# valueFrom: { secretKeyRef: { name: redis-creds, key: password } }
Connectivity test from an API pod
kubectl exec -it deploy/<your-api-deployment> -- sh -lc '
python - <<PY
import os
from prefect_redis.client import get_async_redis_client
import anyio
async def main():
r = await get_async_redis_client(
host=os.getenv("PREFECT_REDIS_MESSAGING_HOST"),
port=int(os.getenv("PREFECT_REDIS_MESSAGING_PORT", "6379")),
db=int(os.getenv("PREFECT_REDIS_MESSAGING_DB", "0")),
username=os.getenv("PREFECT_REDIS_MESSAGING_USERNAME"),
password=os.getenv("PREFECT_REDIS_MESSAGING_PASSWORD"),
ssl=os.getenv("PREFECT_REDIS_MESSAGING_SSL", "false").lower() == "true",
health_check_interval=30,
)
print(await r.ping())
await r.close()
anyio.run(main)
PY
'
You should see True
. If this fails, the same failure is likely causing the 500s.
Staged rollout suggestion
- First enable broker/cache only; keep ordering in memory:
PREFECT_MESSAGING_BROKER=prefect_redis.messaging
PREFECT_MESSAGING_CACHE=prefect_redis.messaging
# Do NOT set PREFECT_SERVER_EVENTS_CAUSAL_ORDERING yet
Verify workers can join and API endpoints work.
- Then enable PREFECT_SERVER_EVENTS_CAUSAL_ORDERING=prefect_redis.ordering and retest.
This helps isolate which Redis-backed component is failing.
Also, please double-check versionsMarvin
09/02/2025, 2:21 PMprefect version
in the worker container and confirm the API image tag. Both should be the same Prefect major (ideally same minor/patch).
- Earlier your stack trace showed an underscore endpoint (/api/work_pools/
), which is 2.x style; 3.x uses hyphens (/api/work-pools
). If the Redis fix makes things work, great — but let’s still confirm versions to avoid a later surprise.
If you can re-enable Redis and share:
- The API pod stack trace when the 500 occurs (set PREFECT_LOGGING_LEVEL=DEBUG
)
- Whether your server image has prefect-redis
installed
- Whether Valkey requires auth on your endpoint
…I can zero in on the exact fix.
References
- Settings reference: Prefect 3 settings
- Redis integration package: prefect-redis