Ryan Sharp
08/21/2025, 6:44 PMMarvin
08/21/2025, 6:44 PMBrendan Dalpe
08/21/2025, 7:16 PMMarvin
08/21/2025, 7:16 PMMarvin
08/21/2025, 7:16 PMBrendan Dalpe
08/21/2025, 7:23 PMI was able to trace the lifecycle of a task through the redis logs and redis is tracking the task successfully and all 3 instances of the API server and the background services are acknowledging each state change from Pending > Running > Completed. However, I found that the task_run is not updated until 15 minutes afterwards in my Postgres db, causing task states to not update as expected in the UI. Why might there be a 15 minute delay and what service is writing the task state to Postgres specifically?
Marvin
08/21/2025, 7:23 PMresearch_prefect_topic
đ Tools used:
⢠explore_module_offerings
(1x)
⢠read_github_issues
(1x)
⢠research_prefect_topic
(3x)Ryan Sharp
08/21/2025, 7:47 PMMarvin
08/21/2025, 7:47 PMRyan Sharp
08/21/2025, 7:47 PMBrendan Dalpe
08/21/2025, 7:50 PMMarvin
08/21/2025, 7:51 PMtask_run
state promptly, the UI will look âstaleâ until the database has the update.
Given your symptoms (Redis sees state transitions immediately; API/background services acknowledge; Postgres row doesnât reflect the change for ~15 minutes), the most common root causes are:
- Your API is reading from a Postgres read replica that has replication lag
- Classic case if youâre on RDS/Aurora and pointed the API (or your own SQL client) at a âreader endpointâ or a loadâbalancer/proxy that routes SELECTs to a replica
- The UI calls the API; if the API reads from a lagging replica, it will serve stale data
- PgBouncer in transaction pooling mode or a misconfigured DB proxy
- Transaction pooling can break session semantics and sometimes leads to surprising readâyourâwrites issues
- Prefect expects readâyourâwrites consistency; use session pooling for PgBouncer
- An HTTP proxy/CDN caching API responses
- Less likely based on your note about the Postgres row itself not changing, but if your SQL checks are also going to a replica, this could still be part of the picture
- Database pressure or locks causing commit delays
- Check DB metrics for long transactions, lock waits, or slow checkpoints; this is rarer to line up at exactly ~15 minutes, but worth ruling out
What to check next
1) Verify where youâre reading/writing
- Confirm the database URL used by the Prefect API points to the writer/primary
- From the DB used by the API, run:
- SELECT pg_is_in_recovery();
â should be false on the writer
- On replicas, SELECT now() - pg_last_xact_replay_timestamp();
â shows replication lag
- If you use Aurora/RDS Proxy or any LB that splits reads/writes, make sure the Prefect API is not hitting a replica for reads
2) If using PgBouncer
- Set pool_mode = session
- Ensure server reset/transaction settings are sane for SQLAlchemy apps
- Restart PgBouncer and Prefect API pods after changes
3) Rule out HTTP caching in front of the API
- Ensure your reverse proxy/CDN isnât caching GET responses to Prefect API endpoints (Cache-Control headers, default TTLs like 900s/15m)
- Temporarily bypass the proxy and hit the API directly to compare behavior
4) Turn on more logging
- Enable debug logging on the API to confirm state write operations are succeeding immediately
- Watch DB slow query/lock logs around the time of state changes
To confirm, could you share:
- Your Prefect version (assuming 3.x) and how youâre deploying the API (k8s, Docker, etc.)
- The exact Postgres topology (managed service? primary/replicas? proxies like RDS Proxy or PgBouncer?)
- The database connection string that the Prefect API uses (redact secrets), and whether any read/write splitting is in play
- Whether your own SQL checks are run against the primary or a replica
Docs for selfâhosting and tuning Prefect are here: https://docs.prefect.io
Once we know a bit more about your DB/proxy setup, we can pinpoint whether itâs replica lag, pooling mode, or something else.Ryan Sharp
08/21/2025, 8:01 PMMarvin
08/21/2025, 8:01 PMMarvin
08/21/2025, 8:17 PMtask_run
state. Redis is used for ephemeral coordination; Postgres is the durable source the API/UI read from.
- A consistent ~15 minute delay on every task strongly suggests a server-side batching/consumer issue, not a DB commit/lock problem.
What this usually means in Prefect 3.x
- Prefectâs server emits task/flow run state events and has a background service (TaskRunRecorder) that consumes those events and persists them to Postgres.
- If the real-time consumer is not running, a periodic âcatch-upâ process will reconcile events at an interval â 15 minutes aligns exactly with that kind of backfill cadence. The result looks like: Redis shows immediate state changes, API logs âackâ, but the task_run
rows only change in Postgres at a fixed interval.
How to confirm quickly
1) Check the services status
- From a host/container that can reach your Prefect server environment:
prefect server services list-services
You should see TaskRunRecorder (and other services) listed and running.
2) If services arenât running, start them
- Run them alongside your API containers or as a separate process/pod:
prefect server services start-services --background
- In k8s, many users run the API (uvicorn/gunicorn) in one deployment and a separate âservicesâ deployment running the command above so the consumers always run.
3) Check server logs
- Look for logs like âStarting TaskRunRecorderâ and âconsumingâ vs only seeing periodic âprocessing followersâ/âbackfillâ messages every ~15 minutes.
- If you see only periodic processing, the live consumer likely isnât running or isnât connected to the same Redis.
4) Sanity-check DB timing
- If rows bunch in groups every ~900 seconds, youâre seeing the periodic backfill:
SELECT date_trunc('minute', updated) AS minute_bucket, count(*)
FROM task_run
WHERE updated > now() - interval '2 hours'
GROUP BY 1
ORDER BY 1 DESC;
Other things to double-check
- All API pods and the services process must point to the same Redis and the same Postgres (your VIP). A mismatch can look like âacknowledgedâ in logs but writes never land until the backfill job runs.
- If youâre launching the API with a custom uvicorn/gunicorn command, that does not automatically run services; you need the separate prefect server services start-services
process.
- Only one services process is fine (it coordinates work), but running multiple is okay as well â they use consumer groups to share the load.
If TaskRunRecorder is already running
- Please share a few lines of its logs around a state change â weâll look for errors, retries, or configuration mismatches.
- Also confirm the Redis and Postgres connection strings used by your API pods and the services process (redact secrets); they must match.
Next step
- Can you run prefect server services list-services
in your environment and paste the output (or confirm whether TaskRunRecorder is running)? If itâs not, start the services and the UI should update in near real-time again.Ryan Sharp
08/21/2025, 8:30 PMMarvin
08/21/2025, 8:30 PMMarvin
08/21/2025, 8:33 PMXINFO STREAM
and XINFO GROUPS
for the events stream, and make sure consumer lag isnât exploding and pending entries arenât stuck for ~15m.
2) Sync clocks everywhere
- Ensure NTP/chrony is running on workers, all API servers, and the background services host. Even small skews across many publishers increase out-of-order rates.
3) Reduce moving parts as a test
- Temporarily scale API down to a single pod and keep one services process running. If the 15âminute lag disappears, weâve confirmed cross-pod ordering is contributing.
- If it persists with one API pod, focus on Redis stream retention/evictions and clocks.
4) Double-check all services are running and seeing the same events
- Youâve confirmed TaskRunRecorder is running. Also confirm EventPersister is running (you have logs from it, which is good).
- Look for repeated âarrived before the event it followsâ messages for the same run across time; if you see them repeatedly until exactly when the 15âmin backfill runs, that matches this diagnosis.
5) Upgrade to the latest 3.4.x
- Youâre on 3.4.12; please try the latest patch (3.4.14 at the time of writing) in case there are stability improvements around services/events. Full changelog: 3.4.12â3.4.14
Why the UI is âstuckâ until Postgres updates
- Yes â the UI reads via the API, and the API uses Postgres as the durable source of truth for run state. Redis is only for ephemeral coordination/events. If the recorder defers a state change because a follower arrived before its predecessor, you wonât see the update in the UI until itâs persisted to Postgres.
If you can share:
- The Redis connection string used by all API pods and the services process (redact secrets)
- Whether your Redis has a maxmemory policy or explicit stream trimming
- Confirmation that NTP/chrony is active and offset is near zero across API/workers
- Whether the problem goes away with a single API podMarvin
08/21/2025, 8:33 PMRyan Sharp
08/22/2025, 2:49 PMMarvin
08/22/2025, 2:49 PMMarvin
08/22/2025, 2:59 PMPREFECT_SERVER_EVENTS__MESSAGING_BROKER=prefect_redis.messaging
- PREFECT_SERVER_EVENTS__MESSAGING_CACHE=prefect_redis.messaging
- Make sure the Redis connection settings are available in every pod that needs them and that the prefect-redis package is installed in those images.
- The Redis plugin uses its own settings class (prefect-redis). If youâre using a URL, the simplest pattern is to provide a single URL consistently to all pods (for example via a secret/env var such as PREFECT_REDIS__URL=redis://:password@redis:6379/0
). If youâre breaking it into host/port/user/pass, apply the same values everywhere.
- If you can share what Redis env vars youâre using, I can sanityâcheck them. The plugin settings are documented here: https://docs-3.prefect.io/integrations/prefect-redis/
- If youâre running background services in a separate pod, confirm they are disabled on the API pods. Otherwise, youâll have multiple consumers competing on the same topics.
- In Helm or your manifests, that typically means the API deployment runs only the API, while the background-services deployment runs only the services. If you have a custom setup, I can help verify the correct enable/disable toggles for each service.
- All pods (API replicas and background services) must be on the same Prefect and prefect-redis versions.
A minimal Kubernetes env example (apply to BOTH API and background-services pods)
- name: PREFECT_SERVER_EVENTS__MESSAGING_BROKER
value: prefect_redis.messaging
- name: PREFECT_SERVER_EVENTS__MESSAGING_CACHE
value: prefect_redis.messaging
- name: PREFECT_REDIS__URL
valueFrom:
secretKeyRef:
name: prefect-redis
key: url
# Optional, but useful while validating
- name: PREFECT_LOGGING_LEVEL
value: DEBUG
Why this matches your matrix
- 1 API, no separate background services, Redis â Works
- Even if cache stayed in-memory, both the API and services are in the same pod/process, so the memory cache accidentally âworksâ.
- 1 API, separate background services, Redis â Fails
- If one pod is on inâmemory cache and the other uses Redis, they wonât share dedup/order state; consumers will interfere and messages can get lost or mis-ordered.
- 3 API, separate background services, Redis â Fails
- The mismatch multiplies with more replicas.
- 3 API, separate background services, InâMemory â Works
- In-memory keeps each pod isolated; some internal server paths donât require crossâpod broker/cache for basic orchestration. But itâs not a scalable/HA setup.
If you can share:
- Prefect version and prefect-redis version
- The exact env vars you have on the API and background-services pods for messaging and Redis
- What âNoâ looks like (e.g., automations donât fire, scheduler doesnât enqueue, logs/events stop, etc.)
- A few DEBUG logs from the background-services pod (triggers/scheduler) around the time of failure
âŚI can help pinpoint further.
References
- Settings reference (server.events, including messaging_broker and messaging_cache): https://docs-3.prefect.io/v3/api-ref/settings-ref
- Redis integration docs: https://docs-3.prefect.io/integrations/prefect-redis/Ryan Sharp
08/25/2025, 1:56 PMMarvin
08/25/2025, 1:56 PMBrendan Dalpe
08/25/2025, 1:58 PM--no-services
in the command. This is already done if you set backgroundServices.runAsSeparateDeployment=true
https://github.com/PrefectHQ/prefect-helm/blob/68cf7c8e6909323a69905645708a8126fc056859/charts/prefect-server/templates/deployment.yaml#L76-L78Marvin
08/25/2025, 2:00 PM# API pods: disable services
server:
api:
extraEnv:
- name: PREFECT_SERVER__SERVICES__SCHEDULER__ENABLED
value: "false"
- name: PREFECT_SERVER__SERVICES__FOREMAN__ENABLED
value: "false"
- name: PREFECT_SERVER__SERVICES__CANCELLATION_CLEANUP__ENABLED
value: "false"
- name: PREFECT_SERVER__SERVICES__LATE_RUNS__ENABLED
value: "false"
- name: PREFECT_SERVER__SERVICES__PAUSE_EXPIRATIONS__ENABLED
value: "false"
- name: PREFECT_SERVER__SERVICES__REPOSSESSOR__ENABLED
value: "false"
- name: PREFECT_SERVER__SERVICES__TRIGGERS__ENABLED
value: "false"
- name: PREFECT_SERVER__SERVICES__TASK_RUN_RECORDER__ENABLED
value: "false"
- name: PREFECT_SERVER__SERVICES__EVENT_PERSISTER__ENABLED
value: "false"
- name: PREFECT_SERVER__SERVICES__EVENT_LOGGER__ENABLED
value: "false"
# Services pod: enable the ones you want
server:
services:
extraEnv:
- name: PREFECT_SERVER__SERVICES__SCHEDULER__ENABLED
value: "true"
- name: PREFECT_SERVER__SERVICES__FOREMAN__ENABLED
value: "true"
- name: PREFECT_SERVER__SERVICES__CANCELLATION_CLEANUP__ENABLED
value: "true"
- name: PREFECT_SERVER__SERVICES__LATE_RUNS__ENABLED
value: "true"
- name: PREFECT_SERVER__SERVICES__PAUSE_EXPIRATIONS__ENABLED
value: "true"
- name: PREFECT_SERVER__SERVICES__REPOSSESSOR__ENABLED
value: "true"
- name: PREFECT_SERVER__SERVICES__TRIGGERS__ENABLED
value: "true"
- name: PREFECT_SERVER__SERVICES__TASK_RUN_RECORDER__ENABLED
value: "true"
- name: PREFECT_SERVER__SERVICES__EVENT_PERSISTER__ENABLED
value: "true"
# Event logger is optional; keep false unless you specifically need it
- name: PREFECT_SERVER__SERVICES__EVENT_LOGGER__ENABLED
value: "false"
Docs for reference:
- Settings reference (lists all services and fields): Prefect 3 Settings reference
- Helm/Kubernetes guidance: Running Prefect Server with Helm
If you can share which Helm chart (and version) youâre using, I can tailor the exact values keys (e.g., where to put extraEnv for the API vs services deployments) to match your chartâs structure.Ryan Sharp
08/25/2025, 3:01 PMRyan Sharp
08/25/2025, 3:12 PMRyan Sharp
08/25/2025, 3:14 PMBrendan Dalpe
08/25/2025, 4:59 PMPREFECT_SERVER_LOGGING_LEVEL=DEBUG prefect server start --no-services
___ ___ ___ ___ ___ ___ _____
| _ \ _ \ __| __| __/ __|_ _|
| _/ / _|| _|| _| (__ | |
|_| |_|_\___|_| |___\___| |_|
Configure Prefect to communicate with the server with:
prefect config set PREFECT_API_URL=<http://127.0.0.1:4200/api>
View the API reference documentation at <http://127.0.0.1:4200/docs>
Check out the dashboard at <http://127.0.0.1:4200>
INFO: Started server process [91831]
INFO: Waiting for application startup.
11:57:09.705 | DEBUG | prefect.server.services.task_run_recorder - TaskRunRecorder started
11:57:09.706 | DEBUG | prefect.server.events.services.event_persister - Event persister started
11:57:09.706 | DEBUG | prefect.server.services.proactivetriggers - Starting ProactiveTriggers
11:57:09.706 | DEBUG | prefect.server.services.proactivetriggers - About to run ProactiveTriggers...
11:57:09.706 | DEBUG | prefect.server.services.proactivetriggers - Finished running ProactiveTriggers. Next run at 2025-08-25 16:57:14.706510+00:00
11:57:09.707 | INFO | prefect.server.events.services.triggers - ReactiveTriggers starting with unique consumer name: reactive-triggers-mbp.localdomain-91831-05d39666
11:57:09.707 | DEBUG | prefect.server.events.triggers - Loading automations
11:57:09.731 | INFO | prefect.server.events.actions - Starting action message handler
11:57:09.731 | DEBUG | prefect.server.events.services.actions - Actions started
INFO: Application startup complete.
11:57:09.732 | INFO | prefect.server.services.task_run_recorder - Starting periodically process followers task every 5.0 seconds
11:57:09.732 | INFO | prefect.server.events.triggers - Starting automation change listener