Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

*What is the best practise for splitting a large workflow with many upstream and downstream datasets into smaller flows?*

We are starting to set up our first couple of flows to manage the pipelines for our data warehouse. Like in most data infrastructures we will probably have several upstream datasets, which will feed into the pipelines for datasets further downstream, which in turn feed into more datasets even further downstream and so on. My question is how to best organise a large workflow like this? You could create one massive flow, but this wouldn’t be very nice to manage. So what is the best way to split this up in prefect?

Some ideas:
• Import a task from the upstream flow into the downstream flow.
I don’t think this has the intended effect. You end up re-registering the upstream flow every time you do the import, and I think it ends up just duplicating the imported task, rather than making the downstream flow actually wait for the upstream flow.

• Create a waiter task that succeeds once the upstream data has landed. 
• Send some kind of event from the upstream flow that triggers the downstream flow. (Not completely sure how I’d do that)
• Create a Parent-flow that triggers a bunch of sub-flows

Is there a “correct” way to do this? What are people’s experiences? Sorry if I missed an obvious piece of documentation or discussion somewhere.

Hey <@U025ES4JFMK>, no one way to do this. I think the third and fourth idea are the way to go here. I have seen some people use the waiter task where they just keep polling on an infinitely loop if the upstream dependencies are completed.

If you want to use event triggers, the upstream tasks would call `Client.create_flow_run` or the `StartFlowRun` task upon completion to trigger the downstream flow.

The parent flow is also a good option especially if the subflows here need heterogenous hardware. They can have different Executors depending on their needs. Everything can be controlled with `StartFlowRun` calls and you could management more complicated dependencies than the previous setups I think. Just pass `wait=True` into the `StartFlowRun` call.