Hi <@ULVA73B9P> My Prefect database contains a *`t...
# ask-marvin
b
Hi @Marvin My Prefect database contains a
task_run_state
table (355 GB, 520,956,207 records) and a
task_run
table (290 GB, 173,330,799 records). I plan to stop the Prefect server, take a backup of 1 week of data from both tables into new backup tables, truncate the original tables, and then restore the data from the backups. Will this approach cause any problems? Note: I cannot delete records directly with a query such as: DELETE FROM task_run WHERE created < current_date - 7; since this would take hours to complete. Can you give me another way I can achieve this task.
m
thought for 337.7 seconds
Short answer: your plan can work, but there are important gotchas to avoid breaking referential integrity and to keep it performant. Key considerations before you proceed - There is no built-in Prefect “prune/retention” command for task/flow runs. The CLI has reset/upgrade/downgrade, but no prune: - You can verify with
prefect server database --help
(no prune subcommand). - Prefect’s run/state tables are related. In most deployments: - task_run_state rows reference task_run - task_run rows reference flow_run - flow_run_state rows reference flow_run - Other tables may reference runs (e.g., logs, artifacts, task_run_input/result, etc.) - Doing CTAS/“swap tables” breaks FKs from other tables because FKs point to object IDs, not names. TRUNCATE + reinsert into the original tables avoids that. - A single massive DELETE is slow and causes bloat; your “copy subset → TRUNCATE → restore subset” approach is a known faster pattern, but you must include all dependent rows and restore in the correct order. A safer runbook for your approach (downtime window) Assumptions: PostgreSQL, a “created” timestamp column, and you can stop all writers (Prefect server and workers/agents). 1) Stop all writers - Stop Prefect Server and workers/agents to prevent new rows during maintenance. 2) Inspect constraints so you know the dependency order - Find FKs that reference these tables so you don’t violate integrity on restore:
Copy code
-- who references task_run?
SELECT
  tc.constraint_name, tc.table_name AS child_table, kcu.column_name AS child_column,
  ccu.table_name AS parent_table, ccu.column_name AS parent_column
FROM information_schema.table_constraints AS tc
JOIN information_schema.key_column_usage AS kcu ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage AS ccu ON ccu.constraint_name = tc.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY' AND ccu.table_name IN ('task_run','task_run_state','flow_run','flow_run_state')
ORDER BY parent_table, child_table;
3) Create backup tables with exactly the rows you’ll keep - Build the backup subset bottom-up (parents first), so restore is easier and fast. If you only keep task runs for the last 7 days, you should include their parents and state history:
Copy code
-- Tune this cutoff to your need
CREATE TEMP TABLE cutoff AS SELECT NOW() - INTERVAL '7 days' AS ts;

-- Keep task runs created in the last 7 days
CREATE UNLOGGED TABLE task_run_bak AS
SELECT tr.*
FROM task_run tr, cutoff c
WHERE tr.created >= c.ts;

-- Keep the associated flow runs
CREATE UNLOGGED TABLE flow_run_bak AS
SELECT DISTINCT fr.*
FROM flow_run fr
JOIN task_run_bak tr ON tr.flow_run_id = fr.id;

-- Keep flow run state history for those flow runs
CREATE UNLOGGED TABLE flow_run_state_bak AS
SELECT frs.*
FROM flow_run_state frs
JOIN flow_run_bak fr ON frs.flow_run_id = fr.id;

-- Keep task run state history for those task runs
CREATE UNLOGGED TABLE task_run_state_bak AS
SELECT trs.*
FROM task_run_state trs
JOIN task_run_bak tr ON trs.task_run_id = tr.id;

-- Optional: If you have logs/artifacts referencing runs, back up the last week for those ids as well
-- Example (adjust to your schema):
-- CREATE UNLOGGED TABLE log_bak AS
-- SELECT l.* FROM log l
-- LEFT JOIN task_run_bak tr ON tr.id = l.task_run_id
-- LEFT JOIN flow_run_bak fr ON fr.id = l.flow_run_id
-- WHERE tr.id IS NOT NULL OR fr.id IS NOT NULL;
Notes: - UNLOGGED speeds up writes and reduces WAL during the backup step. They are temporary tables for this maintenance. 4) Truncate in dependency order (children first) - Truncate state/history tables first, then runs:
Copy code
TRUNCATE TABLE task_run_state;
TRUNCATE TABLE flow_run_state;
TRUNCATE TABLE task_run;
TRUNCATE TABLE flow_run;
-- Also truncate other dependent tables you backed up (e.g., logs), after children first
If FK constraints prevent truncate, ensure you’re truncating strictly child → parent or temporarily disable FK triggers during the window (requires superuser):
Copy code
ALTER TABLE task_run_state DISABLE TRIGGER ALL;
ALTER TABLE flow_run_state DISABLE TRIGGER ALL;
ALTER TABLE task_run DISABLE TRIGGER ALL;
ALTER TABLE flow_run DISABLE TRIGGER ALL;
-- do your TRUNCATE/INSERTs, then re-enable
5) Restore in dependency order (parents first)
Copy code
INSERT INTO flow_run SELECT * FROM flow_run_bak;
INSERT INTO flow_run_state SELECT * FROM flow_run_state_bak;
INSERT INTO task_run SELECT * FROM task_run_bak;
INSERT INTO task_run_state SELECT * FROM task_run_state_bak;
-- Restore any other dependent tables (e.g., logs) last
6) Re-enable triggers (if disabled), analyze, and reclaim space
Copy code
ALTER TABLE flow_run ENABLE TRIGGER ALL;
ALTER TABLE flow_run_state ENABLE TRIGGER ALL;
ALTER TABLE task_run ENABLE TRIGGER ALL;
ALTER TABLE task_run_state ENABLE TRIGGER ALL;

ANALYZE flow_run;
ANALYZE flow_run_state;
ANALYZE task_run;
ANALYZE task_run_state;
- To reclaim space without blocking: consider pg_repack after the operation: - pg_repack - If you can afford blocking, a VACUUM FULL will also reclaim space but locks the table: - VACUUM docs What could go wrong with your initial plan? - Only backing up task_run and task_run_state may violate FKs from/to flow_run, flow_run_state, logs, artifacts, etc. Include those parents/children as shown. - CTAS and swapping tables will break existing FKs from other tables; prefer TRUNCATE + reinsert into the same tables. - Long single-transaction operations can bloat the WAL; using UNLOGGED backup tables and doing restore in one pass helps. Ensure you have enough free disk for the temporary backup copies. - Make sure there’s an index on created for fast filtering when building your backups. Online alternative (no full downtime): batched deletes - If you cannot take long downtime, do batched deletes that target child tables first. Ensure useful indexes: - task_run.created - task_run_state.task_run_id - Example pattern (repeat until 0 rows deleted):
Copy code
-- Delete task_run_state in batches for old task runs
WITH to_del AS (
  SELECT s.ctid
  FROM task_run_state s
  JOIN task_run t ON t.id = s.task_run_id
  WHERE t.created < NOW() - INTERVAL '7 days'
  LIMIT 20000
)
DELETE FROM task_run_state s
USING to_del
WHERE s.ctid = to_del.ctid;

-- Then delete task_run in batches
WITH to_del AS (
  SELECT tr.ctid
  FROM task_run tr
  WHERE tr.created < NOW() - INTERVAL '7 days'
  LIMIT 10000
)
DELETE FROM task_run tr
USING to_del
WHERE tr.ctid = to_del.ctid;
- Run autovacuum aggressively or use pg_repack afterward to reclaim space. Long-term fix: native table partitioning by created - Partition task_run and task_run_state (and optionally flow_run/flow_run_state) by date (daily/weekly). Then “delete” becomes “DROP PARTITION”, which is instant and safe. - Postgres partitioning docs: PostgreSQL partitioning - Converting existing large tables to partitioned usually requires a maintenance window to create a new partitioned table, backfill, and swap; plan carefully due to FKs as noted above. Confirm a few details? - Which Postgres version are you on? - Can you include flow_run/flow_run_state (and logs/artifacts if they have FKs) in the same pruning window? - How much downtime is acceptable? If minimal, I can share a batched-delete script sized to your workload and indexes. I’m happy to tailor the exact SQL to your schema once you confirm the FK relationships and downtime tolerance.