< Marvin> I have a self hosted Prefect server in a Kubernete Prefect Community #ask-marvin

<@ULVA73B9P> I have a self hosted Prefect server i...

Ryan Sharp

08/20/2025, 3:19 PM

@Marvin I have a self hosted Prefect server in a Kubernetes cluster, and my flows are completing successfully, but all the tasks are stuck in a 'pending' state for about 15 minutes after flow completion. I can see that the tasks are actually running due to the logs. I noticed in the worker logs that there are many 'unconfirmed events' piling up, not sure if these are related but what is causing the tasks to be stuck in the 'pending' state?

Marvin

08/20/2025, 3:19 PM

✅ thought for 350.3 seconds

Nate

08/20/2025, 3:22 PM

somehow the worker's websocket client might not be emitting the events... not sure why

Nate

08/20/2025, 3:23 PM

sometimes people have intermittent cnx issues with websockets - how many tasks are we talking? are any

task-run.Completed

events coming through?

Marvin

08/20/2025, 3:25 PM

Short version: your workers aren’t getting their task state events acknowledged by your Prefect server. The UI depends on those events to move tasks out of Pending. When acks don’t arrive (WebSocket issues, base-path misconfig, etc.), the worker logs “unconfirmed events” piling up and the UI leaves tasks Pending until a background reconciliation finally catches up ~10–15 minutes later. What’s going on - “Unconfirmed events” = the worker sent events over the events WebSocket but never received the server’s ack. No ack, no UI updates. - Common culprits in self‑hosted K8s: - Ingress doesn’t support or times out WebSockets. - API URL/base path misconfigured (especially when serving Prefect under a subpath). - Known bugs fixed in recent 3.x releases affecting base-path handling and event services. How to fix (checklist) 1) Verify Prefect versions and upgrade - Make sure server and workers are on the latest 3.x. There were fixes around base-path and services recently. - Release notes: Latest Prefect releases (see 3.4.13 bugfix: “Use server base path configuration when creating internal client”). 2) Check your API URL from the worker’s perspective -

PREFECT_API_URL

must be reachable from worker pods and correct for your deployment. Typical shape is

http(s)://<host>/api

(include the base path if you serve Prefect under one, e.g.

http://<host>/prefect/api

). - If you’re on a subpath, ensure the server/base path settings are applied consistently so Prefect derives the correct WebSocket URL from the API URL. 3) Make sure your ingress permits WebSockets and long-lived connections - NGINX Ingress (example annotations): -

<http://nginx.ingress.kubernetes.io/proxy-read-timeout|nginx.ingress.kubernetes.io/proxy-read-timeout>: "3600"

<http://nginx.ingress.kubernetes.io/proxy-send-timeout|nginx.ingress.kubernetes.io/proxy-send-timeout>: "3600"

<http://nginx.ingress.kubernetes.io/proxy-buffering|nginx.ingress.kubernetes.io/proxy-buffering>: "off"

- ALB/Traefik: enable/allow WebSockets and increase idle timeouts similarly. - After fixing, you should see “Websocket connected” and fewer/no “unconfirmed events” in worker logs. 4) If you use Redis for messaging - There’s an open thread where tasks/subflows vanished from the UI with Redis messaging due to recorder not receiving events: Issue #18540. If that’s your setup, compare your config and logs to that issue. 5) Check server logs for the event services - Look for the task run recorder service starting and processing events; errors here point to the root cause. - Docs: Server (Helm) docs for configuring services and base paths. If you can share: - Prefect versions (server and workers) - Your

PREFECT_API_URL

and whether you’re serving under a subpath - Which ingress controller you’re using (NGINX/ALB/Traefik) and any relevant annotations - Whether you’ve enabled Redis messaging …I can point you to the exact knobs to turn. Otherwise, yes: it ran; the UI is just waiting for the news to arrive. Patience is a virtue. Unfortunately.

Ryan Sharp

08/20/2025, 3:38 PM

@Nate yeah I was having websocket issues show up in the logs when I had basic auth enabled, after disabling basic auth the issues disappeared from the logs at least. I also tried using both the local k8s cluster service address and the external UI api address and updating the DNS as needed, but no luck. It's just the workers that are talking to the api server, not the flow pods as well, right?

Ryan Sharp

08/20/2025, 3:41 PM

The flow I'm testing this with has 12 tasks, I wasn't able to see any of the task-run.Completed events coming through, but the UI does update after about 15 minutes to show they completed

Nate

08/20/2025, 3:49 PM

so flow run pods will indeed need to talk to the api. the engine (runs in the flow run execution environment) does all the chatting with the API / emitting task state change events. the worker just emits a couple events around its own health and when it submits a scheduled flow run

Nate

08/20/2025, 3:49 PM

I wasn't able to see any of the task-run.Completed events coming through, but the UI does update after about 15 minutes to show they completed

hm, it sounds like they eventually showed up then?

Ryan Sharp

08/20/2025, 3:56 PM

@Nate I'd assume they came through at some point but I wasn't able to track down the log entries from the prefect server, wasn't sure if they just timed out at 15 minutes and were marked as completed since the flow completed successfully (it's always almost 15 minutes exactly). For the baseJobTemplate I originally had the dnsConfig set up to use my domain controllers for search/nameservers, but added the internal kubernetes prefect.svc.cluster.local and associated internal ip as well, not sure if that was effective though. Does the worker pass the address for the api to the flow run pod and/or can I set that manually in the env variable for the pod?

Ryan Sharp

08/20/2025, 4:04 PM

@Marvin I tried this on prefect 3.4.10 for the server, workers, and flow run image, as well as prefect 3.4.13. It was working fine on both with the basic helm install, but the issue arose when I converted to high availability setup with 3 api servers, separate background services, a redis cluster in kubernetes, and a separate postgres cluster with patroni outside of kubernetes. For redis I tried using the Redis Enterprise Cluster setup as well as the bitnami Redis master/slave setup. Both seemed to function aside from this issue. I am using an nginx ingress for accessing the UI, but all api server connections should be using internal kubernetes service addresses without a 3rd party ingress.

Marvin

08/20/2025, 4:04 PM

✅ thought for 218.5 seconds

Marvin

08/20/2025, 4:07 PM

Thanks — that narrows it down. Given it worked with the basic Helm install and only broke after you switched to HA with Redis + split background services, this is almost certainly your events pipeline in HA not behaving — i.e., workers’ event acks aren’t being processed, so tasks stay Pending until the server’s reconciliation service eventually catches up. Key HA-specific culprits to check - Redis messaging config consistency - Ensure ALL server components (every API pod and every background service pod) have the same messaging settings and point to the same Redis instance/db: -

PREFECT_MESSAGING_BROKER=prefect_redis.messaging

PREFECT_MESSAGING_CACHE=prefect_redis.messaging

PREFECT_REDIS_MESSAGING_HOST

PREFECT_REDIS_MESSAGING_PORT

PREFECT_REDIS_MESSAGING_DB

, and auth/SSL vars as needed - Verify the prefect-redis integration is installed in the images running API and background services. If it isn’t, Prefect can silently fall back to in-memory messaging on some pods, which breaks acks in HA. - Redis mode support - Prefect’s Redis messaging uses Redis Streams semantics. Redis “cluster” mode can be… uncooperative with streams unless keys are carefully hash-tagged to the same slot. If you’re on Redis Enterprise Cluster or OSS cluster, try: - Pointing Prefect at a proxy/single endpoint that presents a non-sharded view, or - Testing with a single-writer Redis (Bitnami primary or Enterprise with a proxy endpoint) to confirm cluster sharding is the cause. - There’s an open report where Redis messaging caused tasks/subflows to go missing in the UI: Issue #18540. Your symptoms are very similar. - TaskRunRecorder running and connected - In HA, the TaskRunRecorder service must be running in your background services and reading from the same messaging broker as the API pods. If it’s not running or cannot connect to Redis, task events won’t be recorded promptly (hence Pending in UI). - Check logs in the background services for: - “TaskRunRecorder started” - Connection errors to Redis - Backlogs/consumer errors - Prefect server services overview: Server (Helm) docs - API WebSocket path and acks - Even though you’re not using ingress for API, confirm worker logs show “Websocket connected” with stable connections to the API Service and not constant reconnects. - Since you have multiple API pods, make sure all API pods are identically configured for messaging. If some API pods run with in-memory messaging while others use Redis, acks will be inconsistent. Quick isolations to prove the cause - Temporarily disable Redis messaging (let Prefect default to in-memory) across all pods in HA. If the issue disappears, Redis messaging config or Redis mode is the culprit. - Keep Redis messaging on but point to a single non-cluster Redis (Bitnami primary only) and retest. If the problem vanishes, it’s Redis Cluster behavior with streams. Also, stay current - You tested 3.4.10 and 3.4.13. Keep 3.4.13+ for the base-path/internal client fixes and other services reliability improvements: - Latest releases: Release notes If you can share: - The exact env vars on an API pod and on a background service pod (the

PREFECT_MESSAGING_*

and

PREFECT_REDIS_*

set) - Whether your Redis endpoint is true Cluster mode vs a proxy/single endpoint - Logs from the TaskRunRecorder pod during a run …I can call out the exact misconfiguration. But right now, my money’s on mixed messaging backends across pods or Redis Cluster stream behavior. Delightful, I know.

Ryan Sharp

08/20/2025, 4:23 PM

@Marvin I'm not seeing any mention of TaskRunRecorder in the background-services pod, even with debig mode on, this is all I see in the logs: 155814.924 | DEBUG | prefect.profiles - Using profile 'ephemeral' Starting services... Press CTRL+C to stop 155816.383 | DEBUG | prefect.prefect_redis.messaging - Consumer group already exists: BUSYGROUP Consumer Group name already exists 155918.289 | DEBUG | prefect.prefect_redis.messaging - No active consumer groups found for stream actions I just redeployed redis in single master instance mode with no replicas and also placed it within the prefect namespace as a trial, but still no luck. The same issues persist. Here are the env variables on the server and background-services: env: - name: PREFECT_API_DATABASE_MIGRATE_ON_START value: "false" - name: PREFECT_MESSAGING_BROKER value: "prefect_redis.messaging" - name: PREFECT_MESSAGING_CACHE value: "prefect_redis.messaging" - name: PREFECT_REDIS_MESSAGING_HOST value: "redis-master.prefect.svc.cluster.local" - name: PREFECT_REDIS_MESSAGING_PORT value: "6379" - name: PREFECT_REDIS_MESSAGING_USERNAME value: "default" - name: PREFECT_REDIS_MESSAGING_PASSWORD value: "redacted"

Marvin

08/20/2025, 4:23 PM

✅ thought for 272.0 seconds

Marvin

08/20/2025, 4:27 PM

Got it — thanks for the details. The giveaway is your background-services pod not logging “TaskRunRecorder started” (or anything about task_run_recorder) and only showing Redis stream housekeeping. In HA, that means your recorder service isn’t actually running, so worker-emitted task events never get recorded — hence tasks linger in Pending until the eventual reconciliation pass. What to check/fix - Force-enable the recorder (and friends) - Add these env vars to the background-services deployment and restart: -

PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED=true

PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED=true

PREFECT_SERVER_SERVICES_REPOSSESSOR_ENABLED=true

- After restart, you should see logs like “TaskRunRecorder started” in that pod. If not, it’s not being launched. - Ensure the background-services pod has the same DB config as API pods - It must write to the same Postgres as the API. If the DB URL isn’t set there, the recorder can’t persist runs. - Confirm your Postgres env (e.g.,

PREFECT_API_DATABASE_CONNECTION_URL

) is present in the background-services pod. - Redis messaging config consistency - You’ve set broker/cache to Redis — good. Explicitly add

PREFECT_REDIS_MESSAGING_DB

(e.g.,

) so every pod uses the same DB. - The log “No active consumer groups found for stream actions” suggests no consumer is running for that stream — i.e., your background services that should consume from Redis aren’t actually up (again pointing at services not starting). - Make sure the background-services image includes the redis integration - The log line from

prefect.prefect_redis.messaging

implies it’s there, but double-check the image build. Official images need the

prefect-redis

extra or package installed (e.g.,

pip install 'prefect-redis>=0.2'

or use

prefect[redis]

) for all server images, not just workers. - Sanity check WebSocket connectivity from workers - In worker logs you should see “Websocket connected” and not a constant reconnect loop. Unconfirmed events usually mean the server isn’t acking — because the recorder/event services aren’t alive. Useful references - Settings reference (service enable flags, DB, messaging): Settings reference - Helm/server overview: Server (Helm) docs - Related Redis messaging issue (symptoms match): Issue #18540 If you share: - The background-services Deployment spec (image, command/args, env) - Whether

PREFECT_API_DATABASE_CONNECTION_URL

is set in that pod - A few minutes of its logs after restart …I can pinpoint why the recorder isn’t starting. But right now, the fix is to explicitly enable TaskRunRecorder and ensure that pod can reach both Redis and your Postgres. Without the recorder running, the UI will continue to contemplate “Pending” for far too long, much like the rest of us.

Ryan Sharp

08/20/2025, 4:54 PM

@Marvin a redeploy of the server with all redis settings removed cleared it up, but obviously not ideal. Is there a recommended redis setup for prefect HA?

Marvin

08/20/2025, 4:54 PM

✅ thought for 52.2 seconds

Marvin

08/20/2025, 4:55 PM

Short answer: use non-clustered Redis with persistence and a single stable endpoint. Prefect’s events/messaging pipeline uses Redis Streams; cluster mode tends to cause grief (hash slots, multi-stream ops) and leads to exactly the symptoms you saw. Recommended HA patterns for Prefect 3 - Topology - Preferred: a non-clustered Redis with persistence (AOF) and HA failover - Options: - Managed: AWS ElastiCache (cluster mode disabled), GCP Memorystore (Standard), Azure Cache for Redis (non-clustered). - Self-hosted: Bitnami Redis with Sentinel (primary + replicas + automatic failover). - Avoid: Redis Cluster (cluster mode enabled). Prefect does not guarantee stream keys land in the same slot; multi-stream consumer behavior may trigger MOVED errors or partial consumption. If you must use Enterprise/Cluster, use a proxy/single endpoint that abstracts sharding. - Redis configuration - Persistence: enable AOF (appendonly yes) so consumer groups and offsets survive restarts. - Memory policy: use noeviction (or at least do not evict stream keys). Evicting streams will cause dropped events and Pending tasks. - Keepalive/timeouts: sane defaults (e.g., tcp-keepalive 300). - Security/TLS: as required for your environment. - Prefect configuration (all API pods + background-services pods) - Ensure the redis integration is installed in those images (e.g., prefect-redis or prefect[redis]). - Set the same messaging environment in every server-facing pod: -

PREFECT_MESSAGING_BROKER=prefect_redis.messaging

PREFECT_MESSAGING_CACHE=prefect_redis.messaging

PREFECT_REDIS_MESSAGING_HOST=redis.svc.cluster.local

PREFECT_REDIS_MESSAGING_PORT=6379

PREFECT_REDIS_MESSAGING_DB=0

PREFECT_REDIS_MESSAGING_USERNAME=default

(if applicable) -

PREFECT_REDIS_MESSAGING_PASSWORD=...

- Make sure background-services is actually running the recorder: -

PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED=true

- Also keep scheduler/repossessor enabled in HA: -

PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED=true

PREFECT_SERVER_SERVICES_REPOSSESSOR_ENABLED=true

- And that background-services has the same Postgres URL as API pods. - What “healthy” looks like - Background-services logs should show the recorder starting (not just Redis group housekeeping). - Worker logs should show “Websocket connected” and minimal “unconfirmed events”. - Redis should show consumer groups on expected streams: -

XINFO GROUPS events

XINFO GROUPS actions

XPENDING events

should not trend upward indefinitely. - Kubernetes notes - Service endpoints only; you don’t need ingress for API-to-Redis. - If you use Sentinel, point Prefect at the sentinel-aware endpoint (managed by your chart) so failover is automatic but the hostname remains stable. - Keep Redis in the same cluster/region/namespace to minimize latency. If you want, I can share a minimal Bitnami Redis + Sentinel values sketch or check your background-services Deployment to confirm the recorder and messaging settings are wired correctly. References: - Settings reference: Settings reference - Server/Helm overview: Server (Helm) docs - Related Redis messaging issue: Issue #18540 I realize “disable Redis and it works” is tempting. For HA, though, a single endpoint, non-clustered Redis with persistence and the recorder service actually running is the boring, reliable option—my favorite kind.

Nate

08/20/2025, 4:56 PM

yea idk why marvin is not surfacing this but https://docs.prefect.io/v3/advanced/self-hosted https://github.com/PrefectHQ/prefect/discussions/18150

Nate

08/20/2025, 4:57 PM

you will have consistency issues if you run many servers without a distributed message broker

Ryan Sharp

08/20/2025, 5:04 PM

awesome thanks for your time @Nate, I'll give it another shot with the redis sentinel settings

catjam 1

Ryan Sharp

08/20/2025, 5:57 PM

@Marvin I set up a single redis master node in my prefect namespace for testing, but it still isnt working with these settings: env: - name: PREFECT_API_DATABASE_MIGRATE_ON_START value: "false" - name: PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED value: "true" - name: PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED value: "true" - name: PREFECT_SERVER_SERVICES_REPOSSESSOR_ENABLED value: "true" - name: PREFECT_MESSAGING_BROKER value: "prefect_redis.messaging" - name: PREFECT_MESSAGING_CACHE value: "prefect_redis.messaging" - name: PREFECT_REDIS_MESSAGING_HOST value: "redis-master.prefect.svc.cluster.local" - name: PREFECT_REDIS_MESSAGING_PORT value: "6379" - name: PREFECT_REDIS_MESSAGING_PASSWORD value: "..."

Marvin

08/20/2025, 5:57 PM

✅ thought for 89.3 seconds

Marvin

08/20/2025, 5:59 PM

Thanks — that helps. Given you’ve enabled the services and switched to a single Redis master, but it still behaves the same, we need to prove two things: 1) The API pods can publish events to Redis 2) The background-services pod is actually consuming them (TaskRunRecorder running) Right now your logs suggest Redis is reachable, but nothing is consuming “events” (and workers keep piling up “unconfirmed events” because the API isn’t acking/persisting). Do these quick checks: - Verify effective settings inside both an API pod and the background-services pod - Run this in each pod:

Copy code

prefect config view --show-defaults --show-sources

- Confirm these are present and identical in both: -

PREFECT_MESSAGING_BROKER=prefect_redis.messaging

PREFECT_MESSAGING_CACHE=prefect_redis.messaging

PREFECT_REDIS_MESSAGING_HOST=redis-master.prefect.svc.cluster.local

PREFECT_REDIS_MESSAGING_PORT=6379

PREFECT_REDIS_MESSAGING_DB=0

← you’re missing this, set it explicitly -

PREFECT_REDIS_MESSAGING_USERNAME

(set to

default

if your Redis uses ACLs) -

PREFECT_REDIS_MESSAGING_PASSWORD=...

- If any API pod is missing these, it’ll publish to in-memory and never reach Redis. - Confirm the recorder is running in background-services - With

DEBUG

logging, you should see something like “TaskRunRecorder started”. - If you don’t, the service isn’t starting. Keep: -

PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED=true

PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED=true

PREFECT_SERVER_SERVICES_REPOSSESSOR_ENABLED=true

- Check Redis for streams and groups - From a shell with redis-cli access:

Copy code

XINFO STREAM events
    XINFO GROUPS events
    XINFO STREAM actions
    XINFO GROUPS actions

- You should see consumer groups for Prefect. If “events” has no groups while “actions” does, your recorder isn’t attached to events, or the distributor isn’t pushing. - Look for publish/consumer errors in API and background logs - API pods: search for errors from

prefect.server.events.messaging

when creating publishers or publishing (if they can’t publish, they won’t ack workers → “unconfirmed events”). - Background-services: search for

TaskRunRecorder

and any consumer errors. - Ensure prefect-redis is installed in your server images - Inside an API and the background pod:

Copy code

python -c "import prefect.prefect_redis; print('ok')"

- If that import fails, the env vars will be ignored and you’ll silently fall back to in-memory on that pod. Install the extra in the image (e.g., pip install 'prefect-redis>=0.2' or use a Prefect image built with the redis extra). Minimal env set I recommend on all server-side pods (API + background-services)

Copy code

PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED=true
PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED=true
PREFECT_SERVER_SERVICES_REPOSSESSOR_ENABLED=true

PREFECT_MESSAGING_BROKER=prefect_redis.messaging
PREFECT_MESSAGING_CACHE=prefect_redis.messaging
PREFECT_REDIS_MESSAGING_HOST=redis-master.prefect.svc.cluster.local
PREFECT_REDIS_MESSAGING_PORT=6379
PREFECT_REDIS_MESSAGING_DB=0
# If Redis ACLs are enabled:
# PREFECT_REDIS_MESSAGING_USERNAME=default
PREFECT_REDIS_MESSAGING_PASSWORD=...

If, after this, you still don’t see “TaskRunRecorder started”, paste the first 100 lines of the background-services logs after a restart. If “events” has no consumer groups, paste

XINFO GROUPS events

. We’ll zero in on whether the API isn’t publishing or the recorder isn’t consuming. Yes, it’s a bit of “everything must match.” HA is thrilling like that.

Marvin

08/20/2025, 6:22 PM

Your message was too long, here's your message at the allowed limit: I see lots of activity via the redis-cli but still no tasks in the prefect ui 1755714088 778886 [0 192 168 241 233 42780] XAUTOCLAIM actions actions actions-prefect-server-5bbcd44c5b-nns8z-7-a70ec10b 5000 0-0 COUNT 1 1755714088 779271 [0 192 168 225 156 33064] XAUTOCLAIM events event-persister event-persister-prefect-server-5bbcd44c5b-nsf4p-7-dde07edc 5000 0-0 COUNT 1 1755714088 779337 [0 192 168 61 229 47252] XAUTOCLAIM events reactive-triggers reactive-triggers-prefect-server-5bbcd44c5b-fx8lq-7-decb7016 5000 0-0 COUNT 1 1755714088 779461 [0 192 168 241 233 42780] XREADGROUP GROUP actions actions-prefect-server-5bbcd44c5b-nns8z-7-a70ec10b COUNT 1 BLOCK 1000 STREAMS actions > 1755714088 779998 [0 192 168 225 156 33064] XREADGROUP GROUP event-persister event-persister-prefect-server-5bbcd44c5b-nsf4p-7-dde07edc COUNT 1 BLOCK 1000 STREAMS events > 1755714088 780034 [0 192 168 61 229 39250] XAUTOCLAIM events task-run-recorder task-run-recorder-prefect-server-5bbcd44c5b-fx8lq-7-f3e9b917 5000 0-0 COUNT 1 1755714088 780048 [0 192 168 61 229 47252] XREADGROUP GROUP reactive-triggers reactive-triggers-prefect-server-5bbcd44c5b-fx8lq-7-decb7016 COUNT 1 BLOCK 1000 STREAMS events > 1755714088 780565 [0 192 168 61 229 39250] XREADGROUP GROUP task-run-recorder task-run-recorder-prefect-server-5bbcd44c5b-fx8lq-7-f3e9b917 COUNT 1 BLOCK 1000 STREAMS events > 1755714088 987128 [0 192 168 241 234 43272] XAUTOCLAIM events task-run-recorder task-run-recorder-prefect-server-background-services-765b67498b-hjvz7-7-df704d82 5000 0-0 COUNT 1 1755714088 987193 [0 192 168 241 233 42798] XAUTOCLAIM events reactive-triggers reactive-triggers-prefect-server-5bbcd44c5b-nns8z-7-1a456ebe 5000 0-0 COUNT 1 1755714088 987313 [0 192 168 241 234 43252] XAUTOCLAIM events event-persister event-persister-prefect-server-background-services-765b67498b-hjvz7-7-195adb67 5000 0-0 COUNT 1 1755714088 987514 [0 192 168 241 233 42798] XREADGROUP GROUP reactive-triggers reactive-triggers-prefect-server-5bbcd44c5b-nns8z-7-1a456ebe COUNT 1 BLOCK 1000 STREAMS events > 1755714088 987560 [0 192 168 241 234 43272] XREADGROUP GROUP task-run-recorder task-run-recorder-prefect-server-background-services-765b67498b-hjvz7-7-df704d82 COUNT 1 BLOCK 1000 STREAMS events > 1755714088 987685 [0 192 168 225 156 33070] XAUTOCLAIM events task-run-recorder task-run-recorder-prefect-server-5bbcd44c5b-nsf4p-7-61fbe48a 5000 0-0 COUNT 1 1755714088 987701 [0 192 168 61 229 39238] XAUTOCLAIM events event-persister event-persister-prefect-server-5bbcd44c5b-fx8lq-7-5a6f6553 5000 0-0 COUNT 1 1755714088 987785 [0 192 168 225 156 33098] XAUTOCLAIM events reactive-triggers reactive-triggers-prefect-server-5bbcd44c5b-nsf4p-7-df036a9e 5000 0-0 COUNT 1 1755714088 987806 [0 192 168 241 234 43252] XREADGROUP GROUP event-persister event-persister-prefect-server-background-services-765b67498b-hjvz7-7-195adb67 COUNT 1 BLOCK 1000 STREAMS events > 1755714088 988285 [0 192 168 61 229 39238] XREADGROUP GROUP event-persister event-persister-prefect-server-5bbcd44c5b-fx8lq-7-5a6f6553 COUNT 1 BLOCK 1000 STREAMS events > 1755714088 988295 [0 192 168 225 156 33070] XREADGROUP GROUP task-run-recorder task-run-recorder-prefect-server-5bbcd44c5b-nsf4p-7-61fbe48a COUNT 1 BLOCK 1000 STREAMS events > 1755714088 988480 [0 192 168 225 156 33098] XREADGROUP GROUP reactive-triggers reactive-triggers-prefect-server-5bbcd44c5b-nsf4p-7-df036a9e COUNT 1 BLOCK 1000 STREAMS events > 1755714089 088651 [0 192 168 61 229 39220] XAUTOCLAIM actions actions actions-prefect-server-5bbcd44c5b-fx8lq-7-7e4bbc63 5000 0-0 COUNT 1 1755714089 089135 [0 192 168 61 229 39220] XREADGROUP GROUP actions actions-prefect-server-5bbcd44c5b-fx8lq-7-7e4bbc63 COUNT 1 BLOCK 1000 STREAMS actions > 1755714089 189259 [0 192 168 225 156 33054] XAUTOCLAIM actions actions actions-prefect-server-5bbcd44c5b-nsf4p-7-e0a0d368 5000 0-0 COUNT 1 1755714089 189866 [0 192 168 225 15

Marvin

08/20/2025, 6:22 PM

6 33054] XREADGROUP GROUP actions actions-prefect-server-5bbcd44c5b-nsf4p-7-e0a0d368 COUNT 1 BLOCK 1000 STREAMS actions > 1755714089 289747 [0 192 168 241 234 43268] XAUTOCLAIM actions actions actions-prefect-server-background-services-765b67498b-hjvz7-7-eded7241 5000 0-0 COUNT 1 1755714089 290232 [0 192 168 241 234 43268] XREADGROUP GROUP actions actions-prefect-server-background-services-765b67498b-hjvz7-7-eded7241 COUNT 1 BLOCK 1000 STREAMS actions > 1755714089 423014 [0 127 0 0 1 49806] AUTH (redacted) 1755714089 423127 [0 127 0 0 1 49806] ping 1755714089 592683 [0 192 168 241 233 42792] XAUTOCLAIM events

Ryan Sharp

08/20/2025, 6:24 PM

@Marvin I see lots of activity via the redis-cli, but still no tasks in the prefect ui: 1755714088.778886 [0 192.168.241.233:42780] "XAUTOCLAIM" "actions" "actions" "actions-prefect-server-5bbcd44c5b-nns8z-7-a70ec10b" "5000" "0-0" "COUNT" "1" 1755714088.779271 [0 192.168.225.156:33064] "XAUTOCLAIM" "events" "event-persister" "event-persister-prefect-server-5bbcd44c5b-nsf4p-7-dde07edc" "5000" "0-0" "COUNT" "1" 1755714088.779337 [0 192.168.61.229:47252] "XAUTOCLAIM" "events" "reactive-triggers" "reactive-triggers-prefect-server-5bbcd44c5b-fx8lq-7-decb7016" "5000" "0-0" "COUNT" "1" 1755714088.779461 [0 192.168.241.233:42780] "XREADGROUP" "GROUP" "actions" "actions-prefect-server-5bbcd44c5b-nns8z-7-a70ec10b" "COUNT" "1" "BLOCK" "1000" "STREAMS" "actions" ">" 1755714088.779998 [0 192.168.225.156:33064] "XREADGROUP" "GROUP" "event-persister" "event-persister-prefect-server-5bbcd44c5b-nsf4p-7-dde07edc" "COUNT" "1" "BLOCK" "1000" "STREAMS" "events" ">" 1755714088.780034 [0 192.168.61.229:39250] "XAUTOCLAIM" "events" "task-run-recorder" "task-run-recorder-prefect-server-5bbcd44c5b-fx8lq-7-f3e9b917" "5000" "0-0" "COUNT" "1"

Marvin

08/20/2025, 6:24 PM

✅ thought for 280.6 seconds

Marvin

08/20/2025, 6:28 PM

Those Redis entries just mean the server’s loop services are busily polling the streams. They don’t guarantee anything made it into Postgres or that you’re actually looking at the same API the UI/clients are talking to. In other words: lots of chatter, no party. What those consumers are - events: the event bus stream - event-persister: reads “events” and writes them to Postgres - task-run-recorder: turns run state changes into events - reactive-triggers: listens to events and fires triggers - actions: orchestrates action events If the UI is empty, the usual culprits are: - API URL mismatch: your workers/flows and your UI are pointed at different APIs - Workspace/tenant mismatch (Cloud) or wrong profile - DB migrations not applied or Postgres unreachable - UI configured to call the wrong API endpoint behind your ingress/proxy Quick, targeted checks Client side (where you run flows/workers) 1) Confirm the API you’re talking to

Copy code

prefect config view --show-sources

Ensure

PREFECT_API_URL

is your self-hosted Server URL. If this points elsewhere, you’re writing runs to a different API than the UI. 2) Do runs actually exist at that API?

Copy code

prefect flow-run ls --limit 10

- If you see runs here but not in the UI, the UI is pointed at the wrong API. - If this is empty, you haven’t created runs at this API yet. 3) Are events flowing?

Copy code

prefect events stream --format text

You should see events for flow/task state changes when you trigger a run. Server side 4) Migrations

Copy code

prefect server database upgrade

If migrations weren’t applied, the UI will be quite… minimal. 5) Services

Copy code

prefect server services ls

You should see event-persister, task-run-recorder, etc. If you’re running in Kubernetes, also check pod logs:

Copy code

kubectl logs deploy/prefect-server -c event-persister
kubectl logs deploy/prefect-server -c task-run-recorder

Look for DB errors or backoffs. Pay special attention to

PREFECT_API_DATABASE_CONNECTION_URL

. UI/API alignment 6) Verify what API the UI is calling - In your browser dev tools, Network tab, filter for

/api/

– do those requests succeed, and what host are they hitting? - In your UI container/Helm values, verify the API URL env (commonly

PREFECT_API_URL

or a

VITE_*

API URL) matches your server API endpoint. A mismatch here = empty UI. Sanity test 7) Minimal flow against your Server - On a machine that can reach your Server, set

PREFECT_API_URL

to your Server’s API. - Run:

Copy code

from prefect import flow

@flow
def hello():
    return "hi"

if __name__ == "__main__":
    hello()

You should see a flow run in

prefect flow-run ls

and in the UI within a few seconds. A few notes on your Redis entries - XREADGROUP/XAUTOCLAIM from those consumer groups are normal. They indicate the services are polling/claiming messages; they do not confirm persistence into Postgres or that flow/task runs exist. Docs and references - Server services and event persister API reference: Event persister - Managing settings (e.g., PREFECT_API_URL): Manage settings - Settings reference: Settings reference - Workspaces (if you’re on Cloud): Cloud workspaces If you share: - How you’re running Prefect (Server vs Cloud, Helm chart/version) - The

PREFECT_API_URL

from both your worker/flow environment and the UI - The output of

prefect flow-run ls --limit 10

- Any errors from the event-persister/task-run-recorder logs …I can point you to the exact loose screw. I know, thrilling.

Ryan Sharp

08/20/2025, 7:29 PM

@Marvin I ran a series of tests to deduce the failure point, so far: • API server replicas: 1, Separate Redis Pod, Separate background services - did not work • API server replicas: 1, Separate Redis Pod, Background services in same pod as api - works • API server replicas: 3, Separate Redis Pod, Separate background services - did not work • API server replicas: 3, In-memory messaging, Separate background services - works What do you gather from these results? It seems inconsistent

Marvin

08/20/2025, 7:29 PM

✅ thought for 170.2 seconds

Nate

08/20/2025, 7:29 PM

do your background services also have the redis messaging settings configured?

Ryan Sharp

08/20/2025, 7:31 PM

yes I copy and pasted the env variables to both sections of the helm chart (not including the task run recorder, scheduler, and repossessor enable which were in background services only) also verified that these were inside of each of the pods

Marvin

08/20/2025, 7:32 PM

Short version: your Redis-backed setup is suffering from “split-brain” config between the API and the background services. When they share a pod, they share env and it works. When they’re separate, something diverges (Redis URL/db index, TLS, or Postgres creds/migrations), so events get read but never make it into Postgres — hence no task runs in the UI. With in-memory messaging you accidentally keep the pieces that need to talk in the same process, so it also “works”. What your matrix suggests - Separate services + Redis (fails with 1 or 3 API replicas): background services can read Redis (you saw XREADGROUP/XAUTOCLAIM) but cannot persist to Postgres or are listening to a different Redis database/namespace than the API is publishing to. - Services co-located with API + Redis (works): identical env → same Redis URL/index and DB creds → persistence succeeds. - Separate services + in-memory (works): either those services are in-process with the API (same broker) or your manifests for this case gave the services the correct DB env while the Redis case did not. In-memory also avoids any cross-pod broker mismatch. Most likely root causes - Redis mismatch between API and services: - Different database index (redis://host:6379/0 vs /1), password, or TLS scheme (redis:// vs rediss://) - Different broker settings so API publishes to one place and services consume from another - Postgres not writable from the services pod: - Missing/wrong

PREFECT_API_DATABASE_CONNECTION_URL

- Migrations not applied in the environment the services write to - Double-running services in different places with conflicting configs Targeted checks 1) Compare env between API and services pods - Print and diff: - Redis URL (host, port, db index, password, scheme) -

PREFECT_API_DATABASE_CONNECTION_URL

- Any settings you use to pick the events broker (ensure both sides pick Redis in the Redis test) - In Kubernetes:

Copy code

kubectl exec -it <api-pod> -- env | sort
kubectl exec -it <services-pod> -- env | sort

2) Verify both sides see the same Redis stream - From the API pod and the services pod:

Copy code

redis-cli -u "$REDIS_URL" XINFO STREAM events
redis-cli -u "$REDIS_URL" XINFO GROUPS events

You should see the same stream metadata and groups, including

event-persister

and

task-run-recorder

. If one side shows a different DB index (empty stream) or can’t auth, that’s the smoking gun. 3) Check background service logs for DB errors

Copy code

kubectl logs <services-pod> -c event-persister
kubectl logs <services-pod> -c task-run-recorder

Look for psycopg connection failures, permission errors, or “relation … does not exist” (migrations). 4) Confirm migrations

Copy code

prefect server database upgrade

Run this against the same DB the services use. A missing alembic migration will make the persister/recorder unhappy and the UI very quiet. 5) Ensure only one broker is configured for a given test - If the API is configured for Redis but the services default to in-memory (or vice versa), they will never meet. 6) With multiple API replicas - It’s fine for services to run as a single separate process; just don’t run them both in the API and as a separate deployment with conflicting settings. CLI and docs that help - View current settings:

prefect config view --show-sources

- Server services:

prefect server services ls

prefect server services start

- DB migrations:

prefect server database upgrade

- Event persister API ref: Event persister - Settings reference: Settings reference If you can share: - The exact Redis URL used by API vs services (including db index and scheme) - The Postgres URL env in the services pod - Any errors from event-persister/task-run-recorder logs …I can point to the precise misconfiguration. I know, the suspense is unbearable.

Ryan Sharp

08/20/2025, 7:56 PM

@Marvin here are the full env variables for the api server: PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED=true PREFECT_REDIS_MESSAGING_PORT=6379 PREFECT_MESSAGING_CACHE=prefect_redis.messaging PREFECT_DEBUG_MODE=true PREFECT_SERVER_API_BASE_PATH=/api PREFECT_SERVER_LOGS_STREAM_OUT_ENABLED=true PREFECT_API_DATABASE_MIGRATE_ON_START=false PREFECT_MESSAGING_BROKER=prefect_redis.messaging PREFECT_REDIS_MESSAGING_PASSWORD=redispassword PREFECT_REDIS_MESSAGING_DB=0 PREFECT_LOGGING_SERVER_LEVEL=DEBUG PREFECT_UI_API_URL=https://prefect.domain.com/api PREFECT_UI_STATIC_DIRECTORY=/ui_build PREFECT_API_DATABASE_CONNECTION_URL=postgresql+asyncpg://prefect:password@10.42.7.140:5432/prefect PREFECT_REDIS_MESSAGING_HOST=redis-master.prefect.svc.cluster.local PREFECT_SERVER_API_HOST=0.0.0.0 PREFECT_SERVER_API_PORT=4200 PREFECT_SERVER_PORT=tcp://10.97.57.97:4200 PREFECT_SERVER_PORT_4200_TCP=tcp://10.97.57.97:4200 PREFECT_SERVER_SERVICE_HOST=10.97.57.97 PREFECT_SERVER_SERVICE_PORT=4200 PREFECT_SERVER_SERVICE_PORT_SERVER_SVC_PORT=4200 PREFECT_SERVER_PORT_4200_TCP_PROTO=tcp PREFECT_SERVER_PORT_4200_TCP_ADDR=10.97.57.97 PREFECT_SERVER_PORT_4200_TCP_PORT=4200 and these are the env variables for the background services: PREFECT_LOGGING_SERVER_LEVEL=DEBUG PREFECT_UI_ENABLED=false PREFECT_API_DATABASE_CONNECTION_URL=postgresql+asyncpg://prefect:password@10.42.7.140:5432/prefect PREFECT_API_DATABASE_MIGRATE_ON_START=false PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED=true PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED=true PREFECT_REDIS_MESSAGING_PORT=6379 PREFECT_SERVER_SERVICES_REPOSSESSOR_ENABLED=true PREFECT_MESSAGING_CACHE=prefect_redis.messaging PREFECT_SERVER_LOGS_STREAM_OUT_ENABLED=true PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED=true PREFECT_MESSAGING_BROKER=prefect_redis.messaging PREFECT_REDIS_MESSAGING_HOST=redis-master.prefect.svc.cluster.local PREFECT_REDIS_MESSAGING_PASSWORD=redispassword PREFECT_REDIS_MESSAGING_DB=0 PREFECT_DEBUG_MODE=true PREFECT_SERVER_SERVICE_PORT_SERVER_SVC_PORT=4200 PREFECT_SERVER_PORT_4200_TCP_ADDR=10.97.57.97 PREFECT_SERVER_PORT=tcp://10.97.57.97:4200 PREFECT_SERVER_PORT_4200_TCP_PROTO=tcp PREFECT_SERVER_PORT_4200_TCP_PORT=4200 PREFECT_SERVER_SERVICE_HOST=10.97.57.97 PREFECT_SERVER_SERVICE_PORT=4200 PREFECT_SERVER_PORT_4200_TCP=tcp://10.97.57.97:4200 I did not see any errors from the event-persister or task run recorder logs

Marvin

08/20/2025, 7:56 PM

✅ thought for 163.9 seconds

Marvin

08/20/2025, 7:59 PM

Good news: your envs give away the culprit. In your background services pod you never enable the event persister. You’ve set: - PREFECT_SERVER_SERVICES_TASK_RUN_RECORDER_ENABLED=true - PREFECT_SERVER_SERVICES_SCHEDULER_ENABLED=true - PREFECT_SERVER_SERVICES_REPOSSESSOR_ENABLED=true - PREFECT_SERVER_SERVICES_EVENT_LOGGER_ENABLED=true …but not: - PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED=true When you co-locate services with the API, the default set of services tends to be enabled together and persistence happens — so it “works.” When you split them out, you’ve explicitly enabled some services and accidentally left the persister off — so events are read from Redis but never flushed to Postgres, and the UI remains barren. What to do - Add this to the background services env and redeploy:

Copy code

PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED=true

- Then confirm services are actually running:

Copy code

prefect server services ls

You should see Event Persister, Task Run Recorder, Scheduler, etc. “running.” Sanity checks (just in case) - Verify both pods use the same image/tag (so they share the same messaging plugin and settings). - Confirm DB connectivity from the services pod. If you’ve never applied migrations in this env:

Copy code

prefect server database upgrade

- Ensure both API and services agree on Redis settings: - PREFECT_MESSAGING_BROKER=prefect_redis.messaging - PREFECT_REDIS_MESSAGING_HOST/PORT/PASSWORD/DB - If you want to be thorough, from the services pod:

Copy code

redis-cli -h redis-master.prefect.svc.cluster.local -a redispassword -n 0 XINFO GROUPS events

You should see the consumer groups created by the event persister and task-run recorder. References - Settings reference (service enable flags): Settings reference - Event Persister service API ref: Event persister Flip that one flag and your “inconsistent” matrix should become consistent, in the satisfying way that makes everyone else yawn.

Ryan Sharp

08/20/2025, 8:04 PM

@Marvin I had already enabled ALL services on the api server and background services according to 'prefect server services ls', per your instructions should some of them have been disabled on one or the other?

Marvin

08/20/2025, 8:04 PM

🔧 Using:

check_cli_command

📊 Tools used: •

check_cli_command

(3x) •

display_callable_signature

(1x) •

explore_module_offerings

(1x) •

research_prefect_topic

(3x) •

search_github_discussions

(1x)

Ryan Sharp

08/20/2025, 8:16 PM

@Marvin event persister had already been enabled on the background services, and also on the api server by default

Marvin

08/20/2025, 8:16 PM

✅ thought for 263.4 seconds

Nate

08/20/2025, 8:16 PM

hi @Ryan Sharp its possible that marvin forgets this thread..... i understand that'd be unfortunate, just heads up

Nate

08/20/2025, 8:17 PM

tldr i had to redeploy a thing

Nate

08/20/2025, 8:18 PM

if you're still encountering issues i think we could codify this thread as a discussion so we can investigate any rough edges and others can discover it more easily

Marvin

08/20/2025, 8:20 PM

You’re right — in Prefect 3.x the Event Persister is enabled by default for both the API server and the background services. Quick reference: - What it does: Consumes event messages (e.g., state changes) and persists them to the Prefect database for querying/automations/audit. - Defaults: Enabled on both API server and background services out of the box. How to configure or disable: - Toggle on/off - Set

PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED=false

to disable. - Example (Docker Compose):

Copy code

services:
      prefect-server:
        environment:
          PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED: "false"

- Example (Kubernetes):

Copy code

env:
    - name: PREFECT_SERVER_SERVICES_EVENT_PERSISTER_ENABLED
      value: "false"

- Tuning knobs (if you need to adjust throughput/latency): -

PREFECT_SERVER_SERVICES_EVENT_PERSISTER_BATCH_SIZE

PREFECT_SERVER_SERVICES_EVENT_PERSISTER_FLUSH_INTERVAL

- These control batching and flush cadence for event writes. Where events are stored: - In your Prefect Server database (PostgreSQL for production). The server’s event storage layer writes/reads via the events database helpers (e.g.,

write_events

read_events

). How to verify it’s running: - Logs: Set

PREFECT_LOGGING_LEVEL=DEBUG

on the API/background services; you should see the Event Persister start and periodic batches being processed. - API probe: Hit the events endpoint to confirm you’re receiving data

Copy code

curl -s "$PREFECT_API_URL/events?limit=1"

If your

PREFECT_API_URL

already ends with

/api

, just append

/events

. Relevant docs: - Event Persister (API reference): Event Persister service (API ref) - Settings reference (env vars): Settings reference - Events API endpoints: Events API (server) If you’re looking to disable it for a specific service or tune batch/flush settings, tell me how you’re deploying Prefect (Docker Compose, Helm, bare processes) and I can share exact snippets.

Ryan Sharp

08/20/2025, 10:37 PM

@Nate no worries, the context was definitely getting lengthy for the bot, I'll take another crack at this tomorrow with a couple different redis configurations

Ryan Sharp

08/21/2025, 6:40 PM

@Marvin I was able to trace the lifecycle of a task through the redis logs and redis is tracking the task successfully and all 3 instances of the API server and the background services are acknowledging each state change from Pending > Running > Completed. However, I found that the task_run is not updated until 15 minutes afterwards in my Postgres db, and that is what the UI must be reading from. Why is there a 15 minute delay and what service is writing the task state to Postgres specifically?

Marvin

08/21/2025, 6:40 PM

🔧 Using:

explore_module_offerings

📊 Tools used: •

display_callable_signature

(1x) •

explore_module_offerings

(3x) •

research_prefect_topic

(1x)

27 Views

Open in Slack

Previous Next