Hey all, I'm working on starting what's essentially a data warehouse at my company, and thinking of using prefect to schedule and orchestrate all the ETL. We're going to have a number of 3rd party data sources to pull data from on various schedules, and then we'll likely want to schedule some transformations after certain combinations of tables are finished loading each day. I'm trying to get the project architecture off on the right foot and have been trying out prefect for a couple days, and I'm wondering how I should think about organizing my Flows.I'm wondering, should I have one flow per data source, or one flow for the whole pipeline?My dilemma is that each data source is going to have its own schedule, which leads me to have one Flow per source, but if I want to trigger transformations based on the completion of table loads, that feels like the flows are going to have dependencies on each other's completions and would be better off as one flow.Thoughts? Any examples out there of similar projects?
2 years ago
Hey @Daniel Veenstra — great question! This is a really common use of Prefect. In fact, we’ve been setting up Prefect (Inc’s) own data warehouse this week and we’ll be publishing exactly how we did it soon (for the moment, a Postgres → BigQuery ETL).
Both approaches can work, so I think the choice will depend on which gives you the most introspection and control. Because each data source will run on its own schedule, a flow-per-source feels natural to me; each flow will act as a single “function” that you call to process a single source.
We don’t have first-class support for flow-to-flow triggers (yet! it’s coming), so you would have to decide on the appropriate way to set up the transformation triggers once upstream tables are loaded. One common approach (going back to Airflow) is to have a flow that periodically pings upstream tables and, once all are ready, executes itself.
However, if your transformations are strict 1:1 triggers, you may in fact prefer a single flow, and you could explore Prefect’s caching mechanisms to ensure that each data source only runs on the appropriate cadance. I have a small concern that this will introduce greater complexity than it solves.
We spend a lot of time thinking about this use case with various partners and I’m happy to share insights — just DM me if there’s anything you’d prefer to keep private
2 years ago
OK, awesome, I'll keep an eye out for that. That makes sense.Cool to hear that flow-to-flow triggers are coming, as a flow-of-flows was one of the way that occurred to me to solve this, as a way to abstract each individual flow as a task inside of a larger flow, though I don't understand enough about the internals to know if that even makes sense architecturally.Periodically pinging the upstream seems like a good approach for now, so thanks for that idea as well.
2 years ago
Flow-of-flows could definitely work, but will inherit the same difficulties with multiple schedules. You’d basically set up each task to adopt the final state of the flow it runs, and then Prefect triggers coul dhandle the dependency logic.
2 years ago
Many of the transformations will likely be 1:1 but some will probably be joins between data sources that load on different schedules. How would prefect's caching mechanisms ensure the data load cadence? I'm not sure I'm following you there.
Interesting that flow-of-flows could already be implemented, I hadn't realized a task could run a flow as you described. But I see now how that still inherits the multiple schedules issue. Sounds like triggering may be the missing piece I'm looking for.
2 years ago
After just finishing up writing up my first flow, One approach to take is writing one task at a time to one flow. With the knowledge that they will be broken down. In writing my big flow .. its becoming more apparent which tasks can get abstracted out and made into other flows
Yea that makes sense. I ended up with a similar process working on my first flow, breaking out one big task into a small handful. Now to figure out how to parameterize my database credentials for dev vs. prod and get this bad boy deployed.
Having done something like 100's of tasks for 1 flow, it feels more manageable to have a handful of tasks for flows and for successful flow runs to trigger other flows
Our team maintained a large manifest file where each file had all of it's dependencies listed. It ended up being a large file. It's not hard to follow and you only have to update it for the jobs you are adding but it seems easier for there to be multiple localized manifest files and having the system collate them on behalf of the user
It also plays well with how DBT projects are structured, which I am planning on using with Prefect