Slackbot
09/26/2025, 2:45 PMMarvin
09/26/2025, 2:45 PMMarvin
09/26/2025, 2:49 PMdeployment
, work_queue
, and the automation_*
tables.
Why this is happening
- These specific tables are tiny but updated very frequently by the Prefect server (e.g., heartbeats, watermarks, poll times). That pattern creates lots of dead tuples quickly.
- If autovacuum isn’t aggressive enough, or if long‑running transactions hold the xmin horizon, dead tuples won’t be reclaimed and bloat accumulates.
- On tiny, hot tables, default autovacuum thresholds can be too high (scale factor never triggers), and frequent updates to the same pages can outpace vacuum.
Step 1 — Check for blockers (critical)
If VACUUM isn’t reducing dead tuples, a long‑running transaction is the usual culprit.
Run these and share results:
-- Any long-running transactions that can block pruning?
SELECT pid, usename, state, xact_start, query_start, backend_type, wait_event_type, wait_event, query
FROM pg_stat_activity
WHERE datname = current_database() AND xact_start IS NOT NULL
ORDER BY xact_start;
-- Table-level vacuum/analyze history
SELECT relname, n_live_tup, n_dead_tup, last_vacuum, last_autovacuum, vacuum_count, autovacuum_count, last_analyze, last_autoanalyze
FROM pg_stat_user_tables
WHERE relname IN ('deployment','work_queue','automation_event_follower','automation_bucket');
-- Any per-table autovacuum overrides already set?
SELECT c.relname, c.reloptions
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE n.nspname NOT IN ('pg_catalog','information_schema')
AND c.relname IN ('deployment','work_queue','automation_event_follower','automation_bucket');
-- Check XID age (sanity check)
SELECT c.relname, age(c.relfrozenxid) AS xid_age
FROM pg_class c
JOIN pg_namespace n ON n.oid = c.relnamespace
WHERE n.nspname NOT IN ('pg_catalog','information_schema')
AND c.relname IN ('deployment','work_queue','automation_event_follower','automation_bucket');
If you see “idle in transaction” sessions hanging around for minutes/hours, consider:
- Temporarily setting idle_in_transaction_session_timeout
(e.g., 5m)
- Killing offending sessions before running an aggressive vacuum
Step 2 — Run an aggressive manual VACUUM and inspect output
This will tell us if old snapshots are preventing cleanup.
VACUUM (VERBOSE, FREEZE, ANALYZE) public.deployment;
VACUUM (VERBOSE, FREEZE, ANALYZE) public.work_queue;
VACUUM (VERBOSE, FREEZE, ANALYZE) public.automation_event_follower;
VACUUM (VERBOSE, FREEZE, ANALYZE) public.automation_bucket;
If the VERBOSE output mentions it could not remove tuples due to old snapshots, go back to Step 1 and clear blockers.
Step 3 — Make autovacuum far more aggressive for these hot, tiny tables
For very small, update-heavy tables, make vacuum/analyze trigger on small absolute changes rather than scale factors:
```
ALTER TABLE public.deployment SET (
autovacuum_enabled = true,
autovacuum_vacuum_scale_factor = 0.0,
autovacuum_vacuum_threshold = 20,
autovacuum_analyze_scale_factor = 0.0,
autovacuum_analyze_threshold = 10,
autovacuum_vacuum_cost_limit = 2000,
autovacuum_vacuum_cost_delay = 10
);
ALTER TABLE public.work_queue SET (
autovacuum_enabled = true,
autovacuum_vacuum_scale_factor = 0.0,
autovacuum_vacuum_threshold = 20,
autovacuum_analyze_scale_factor = 0.0,
autovacuum_analyze_threshold = 10,
autovacuum_vacuum_cost_limit = 2000,
autovacuum_vacuum_cost_delay = 10
);
ALTER TABLE public.automation_event_follower SET (
autovacuum_enabled = true,
autovacuum_vacuum_scale_factor = 0.0,
autovacuum_vacuum_threshold = 20,
autovacuum_analyze_scale_factor = 0.0,
autovacuum_analyze_threshold = 10,
autovacuum_vacuum_cost_limit = 2000,
autovacuum_vacuum_cost_delay = 10
);Marvin
09/26/2025, 2:49 PMNotes:
- Setting scale factors to 0 forces autovacuum/analyze to fire based on the absolute thresholds (good for small tables).
- You can tune thresholds (e.g., 10–50) based on observed churn.
- Consider temporarily setting `log_autovacuum_min_duration = 0` to log what autovacuum is doing and confirm it’s running frequently enough; revert after tuning.
Step 4 — Reindex concurrently if indexes are bloated
Even when you reclaim heap tuples, index bloat can linger.
REINDEX TABLE CONCURRENTLY public.deployment;
REINDEX TABLE CONCURRENTLY public.work_queue;
REINDEX TABLE CONCURRENTLY public.automation_event_follower;
REINDEX TABLE CONCURRENTLY public.automation_bucket;
Step 5 — Optional hardening for update-heavy tables: FILLFACTOR for HOT updates
For tiny tables that get updated constantly (and don’t change indexed columns), lowering fillfactor leaves room on each page to keep updates HOT, reducing index churn:
ALTER TABLE public.work_queue SET (fillfactor = 80);
ALTER TABLE public.deployment SET (fillfactor = 80);
ALTER TABLE public.automation_event_follower SET (fillfactor = 80);
ALTER TABLE public.automation_bucket SET (fillfactor = 80);
```
Note: fillfactor applies to newly written pages. To fully realize the benefit you need a table rewrite (pg_repack or VACUUM FULL or CLUSTER). If you want to avoid locks, schedule pg_repack --no-order --wait-timeout=...
during a quiet window.
When to use pg_repack vs VACUUM FULL
- If you confirm no old snapshots and aggressive VACUUM + REINDEX CONCURRENTLY isn’t enough, pg_repack
is the online, lower-impact way to rewrite and shrink these tables.
- VACUUM FULL
will also work but will take an exclusive lock; since these are small tables the lock may be short but it will block writers.
Prefect-specific notes
- Prefect 3.4.7 has no known bug that would uniquely cause this bloat pattern; these tables are just update-heavy by design. If feasible, consider upgrading to the latest 3.x for general fixes.
- For broader DB guidance and retention strategies, see:
- Prefect docs: Database maintenance
- Blog: More memory, more problems
- Blog: Partitioning Postgres without downtime
Next steps
- Let me know:
- Results of the queries in Step 1 and the VACUUM VERBOSE output
- Whether you have constraints on downtime/locks (helps decide between pg_repack and VACUUM FULL)
- If you’re open to setting table-level autovacuum overrides and timeouts (e.g., idle_in_transaction_session_timeout
)
- I can help you pick exact thresholds and finalize a maintenance runbook once we see the blocker/age stats.