Josh Greenhalgh

    Josh Greenhalgh

    1 year ago
    Is there any movement towards having a more airflow like backfilling feature - its literally the one thing I miss - I would really love to be able to set a start date for a schedule in the past and for the flows to catch up to the present - currently I think the only way to achieve is to make use of the params to differentiate future runs for past and to manually start all the backfill flows - is there perhaps a better way currenty?
    Maybe there is a better way to think about the idea of "backfilling" that isn't so wedded to airflow that might clarify things for me?
    Jeremiah

    Jeremiah

    1 year ago
    Hi @Josh Greenhalgh — in Airflow, runs can’t be parameterized and are inseparably tied to an
    execution_date
    , so there’s a need for “backfills” that effectively hack around that to simulate runs that took place in the past. In Prefect, we make no assumptions about the relationship between your workflow and time. Instead, if your workflow does depend on a specific time, best practice is to introduce that dependency as a parameter or via context. We think of workflows as functions, so in this case your
    execution_date
    is simply an argument to that function. For most runs you might use a default value like the current date or the run’s
    scheduled_start_time
    , but you could always pass a historical date for backfill semantics. Here’s a more complete explanation from stack overflow with some sample code: https://stackoverflow.com/questions/64029629/is-there-a-way-to-backfill-historical-data-once-for-a-new-flow-in-prefect
    By the way, if you need to generate the list of dates for your backfill, one easy way is to use your flow’s schedule (if it has one).
    flow.schedule.next(n=50, after=pendulum.datetime(2021, 1, 1))
    This will generate the next 50 scheduled start times for this flow’s schedule, beginning after 1/1/21
    Josh Greenhalgh

    Josh Greenhalgh

    1 year ago
    Thanks! That doesn't play very seamlessly with the existing idea of schedules though does it? If for past runs I need to submit the flow runs with params but for the future runs I can schedule them theres this uncomfortable disconnect - why would I not (I wont...) run my own process to trigger all the future schedules as well if I am doing it for the past?
    Is perhaps better usage of flows triggering other flows what I am missing? For example I could envision a flow that does the backfill for me that I then trigger on an adhoc basis?
    Jeremiah

    Jeremiah

    1 year ago
    In Prefect, the Flow is the first-class object and is designed to be run at any time and for any reason. Schedules are just convenient ways to tell Prefect that you want it to kick off new runs automatically. But there’s no requirement for a flow to have a schedule - indeed, many people never use them because they kick off new flow runs ad-hoc whenever they need. The reason this is possible is because Prefect flows are parameterized, which is different from Airflow DAGs, which can only take time inputs from their schedule. Therefore, if your flow’s behavior changes depending on what date or time it is, it’s best practice in Prefect to parameterize time as an input to the flow. That way, you can run your flow whenever you want - either on or off schedule - and still produce the expected behavior for a given “as-of date”. For example, you could kick off a flow run right now and produce the results you would have expected at the end of the year (if you provided 12/31/2020 as the as-of date).  So when we talk about “backfills”, that’s really all it is from Prefect’s point of view: you kick off a bunch of flow runs now that happen to accept past dates as part of their parameterized input. It’s just the same as kicking off any off-schedule flow run, except that you’re providing inputs that happen to correspond to known dates.  And finally back to schedules: Prefect can’t automatically run flows that have parameters unless the parameter has a default value, because it has no way of knowing what value to use! But if your flow does have default parameters, you can have the best of both worlds: Prefect can kick off new runs automatically, and you can override those runs at any time to backfill results.  By the way, if you want to run your own process to schedule future runs, you absolutely can! But the beauty of this system is that you don’t have to, without sacrificing the ability to run ad-hoc backfills (or any other type of run) whenever you want.
    Also, running another flow to generate backfills is great! Absolutely nothing wrong with that. In addition, we might look at exposing a Prefect API for generating the parameterized runs automatically… don’t want to make any promises about that, but we’ll take it under consideration. Because Prefect parameters are so arbitrary, it might be hard to create a one-size-fits-all feature.