<@ULVA73B9P> What would you recommend as the best ...
# ask-marvin
e
@Marvin What would you recommend as the best way to implement backfilling into our deployments?
m
thought for 472.7 seconds
Great question. In Prefect 3.x, the most reliable way to “backfill” a deployment is to explicitly create the historical flow runs you need, while controlling concurrency so you don’t overwhelm your infra. Here are the recommended patterns (pick what fits your workflow and volume): 1) Programmatic backfill driver (recommended for most cases) - Use the Python API to submit one run per partition (e.g., per day/hour) with parameters, tags, and idempotency keys. - Gate concurrency with an asyncio semaphore or via Prefect concurrency features (see links below). - This gives you full control and is easy to repeat/monitor. Example:
Copy code
import asyncio
import pendulum
from prefect.deployments.flow_runs import run_deployment

DEPLOYMENT = "my-flow/prod"  # <FLOW_NAME>/<DEPLOYMENT_NAME>
START = pendulum.datetime(2024, 1, 1, tz="UTC")
END = pendulum.datetime(2024, 1, 31, tz="UTC")
MAX_PARALLEL = 5

async def submit_one(ds):
    return await run_deployment(
        name=DEPLOYMENT,
        parameters={"ds": ds.to_date_string()},  # adjust to your flow’s param(s)
        flow_run_name=f"backfill-{ds.to_date_string()}",
        scheduled_time=pendulum.now("UTC"),      # run ASAP
        tags=["backfill"],
        idempotency_key=f"backfill:{DEPLOYMENT}:{ds.to_date_string()}",
    )

async def main():
    sem = asyncio.Semaphore(MAX_PARALLEL)
    async def with_sem(ds):
        async with sem:
            return await submit_one(ds)

    days = (END - START).days + 1
    dates = [START.add(days=i) for i in range(days)]
    await asyncio.gather(*(with_sem(d) for d in dates))

if __name__ == "__main__":
    asyncio.run(main())
2) CLI for small backfills or ad-hoc runs - Creates a single run. Loop in your shell for short ranges. - You can schedule start times, pass parameters, and watch. Examples:
Copy code
# Single run now with parameters
prefect deployment run my-flow/prod -p ds=2024-01-01

# Single run that starts at a specific time
prefect deployment run my-flow/prod -p ds=2024-01-01 --start-at "2024-01-01T00:05:00Z"

# Simple loop (bash) for a handful of days
for d in 2024-01-01 2024-01-02 2024-01-03; do
  prefect deployment run my-flow/prod -p ds=$d
done
We verified the CLI structure: - prefect deployment run: create a run - prefect deployment schedule create: manage schedules - prefect concurrency-limit ... and prefect work-pool ...: concurrency controls 3) Temporary schedule for a controlled set of timestamps - If you want Prefect to generate the occurrences for you, you can create a temporary schedule (e.g., an RRule for a specific date range), let the scheduler create runs, then delete the schedule afterward. - This is neat for precise, enumerated points-in-time, but programmatic submission (Pattern 1) tends to be more explicit and repeatable for backfills. Example:
Copy code
prefect deployment schedule create my-flow/prod \
  --rrule "FREQ=DAILY;DTSTART=20240101T000000Z;UNTIL=20240131T000000Z"
Note: Schedules focus on upcoming occurrences. For past-heavy backfills with lots of points, the programmatic approach is typically simpler and more predictable. Controlling concurrency and load - Work pool concurrency: throttle total concurrent flow runs executed by a pool
Copy code
prefect work-pool set-concurrency-limit <your-pool> 5
- Task-level concurrency limits: apply a tag to heavy tasks (e.g., db writes) and set a limit so backfills don’t overwhelm systems
Copy code
prefect concurrency-limit create warehouse_writes 3
Then tag the relevant tasks in your flow with
tags=["warehouse_writes"]
. - You can also submit runs with a “backfill” tag (as in the Python example) to filter/monitor in the UI. Operational tips - Use idempotency keys to avoid duplicate submissions for the same partition. - Name runs with the partition (e.g., backfill-YYYY-MM-DD) for traceability. - Consider pausing the “regular” schedule during big backfills if you want to isolate the load.
- For very large ranges, submit in batches to avoid overloading your API or infra. Docs and references - Schedules overview: Prefect 3 Schedules - Concurrency limits (CLI): CLI: concurrency-limit - Global concurrency/how-to: Global concurrency limits - Python API used above:
prefect.deployments.flow_runs.run_deployment
(we verified availability and signature) If you can share a bit more about your backfill shape (e.g., date-partitioned daily/hourly? expected number of runs? deployment/work pool type?), I can tailor a concrete plan and code snippet for your environment.