What is the best practise for splitting a large wo...
# ask-community
h
What is the best practise for splitting a large workflow with many upstream and downstream datasets into smaller flows? We are starting to set up our first couple of flows to manage the pipelines for our data warehouse. Like in most data infrastructures we will probably have several upstream datasets, which will feed into the pipelines for datasets further downstream, which in turn feed into more datasets even further downstream and so on. My question is how to best organise a large workflow like this? You could create one massive flow, but this wouldn’t be very nice to manage. So what is the best way to split this up in prefect? Some ideas: • Import a task from the upstream flow into the downstream flow. I don’t think this has the intended effect. You end up re-registering the upstream flow every time you do the import, and I think it ends up just duplicating the imported task, rather than making the downstream flow actually wait for the upstream flow. • Create a waiter task that succeeds once the upstream data has landed.  • Send some kind of event from the upstream flow that triggers the downstream flow. (Not completely sure how I’d do that) • Create a Parent-flow that triggers a bunch of sub-flows Is there a “correct” way to do this? What are people’s experiences? Sorry if I missed an obvious piece of documentation or discussion somewhere.
k
Hey @Hilary Roberts, no one way to do this. I think the third and fourth idea are the way to go here. I have seen some people use the waiter task where they just keep polling on an infinitely loop if the upstream dependencies are completed. If you want to use event triggers, the upstream tasks would call
Client.create_flow_run
or the
StartFlowRun
task upon completion to trigger the downstream flow.
👍 1
The parent flow is also a good option especially if the subflows here need heterogenous hardware. They can have different Executors depending on their needs. Everything can be controlled with
StartFlowRun
calls and you could management more complicated dependencies than the previous setups I think. Just pass
wait=True
into the
StartFlowRun
call.