I have 3 flows for a data workflow flow1=extracts the data f Prefect Community #prefect-getting-started

I have 3 flows for a data workflow (flow1=extracts...

Maryam Veisi

02/27/2023, 7:29 AM

I have 3 flows for a data workflow (flow1=extracts the data, flow2= makes data ready, flow 3= creates training set) For the next step of work, we want to run this workflow (3 flows) in parallel for different parameters +1000 times . Our concerns are: 1- memory 2- how to set up the pipeline to run +1000 flows in parallel + run the subflows sequentially Any thoughts?

✅ 1

Austin Weisgrau

02/27/2023, 4:34 PM

Is memory a concern? You could set up a cloud function or cloud container (AWS lambda or ECS) to take parameters and run the flows, and then iterate through the parameters locally triggering all the cloud jobs

Maryam Veisi

02/27/2023, 4:43 PM

Thanks @Austin Weisgrau I am very new to prefect. Our concerns are: 1- memory 2- how to set up the pipeline to run +1000 flows in parallel + run the subflows sequentially

Austin Weisgrau

02/27/2023, 7:04 PM

So prefect itself doesn't actually execute the flows by itself. You can either: • execute the flows like you would run normal python code. So you could write a python or bash script to execute the flows. • Create a deployment based on the flows, and then trigger all the deployment runs with parameters using the deployment API, and then have a bunch of prefect agents running in some local or cloud infrastructure you set up to run the flows (more complicated than it sounds like you want or need)

Austin Weisgrau

02/27/2023, 7:04 PM

If you want to run 1000+ flows in parallel, you're probably not going to be able to do that on one machine, so you'd probably need to set up some kind of cloud infrastructure to execute the flows on a bunch of different machines.

Austin Weisgrau

02/27/2023, 7:08 PM

To set up the subflows to run in sequence, you can set up a flow that looks like:

Copy code

@flow
def run_pipeline(parameters):
   extracted_data = flow_1(parameters)
   prepared_data = flow_2(extracted_data)
   training_set = flow_3(prepared_data)

to execute this flow one time, you'd run it like normal python code

Copy code

import run_pipeline

my_parameters = {'something': something}

run_pipeline(my_parameters)

👍 1

Maryam Veisi

02/27/2023, 7:46 PM

Thanks @Austin Weisgrau I think we will go with the second option that you suggested and create a deployment based on the flows. Do you have any good examples or resources to learn about refect deployment?

Austin Weisgrau

02/27/2023, 7:47 PM

Here's the docs on deployments: https://docs.prefect.io/concepts/deployments/

🙌 1

Maryam Veisi

03/01/2023, 4:06 PM

@Austin Weisgrau I have been thinking about your second solution and still don't know where in the deployment the parallel runs of flows are defined.

Maryam Veisi

03/01/2023, 4:18 PM

is this the process that we need to follow? • we use asyncio when writing the flow • we set up infrastructure to be able to run the parallel flows • we define agents in a way to be able to run the parallel flows

Austin Weisgrau

03/01/2023, 4:31 PM

You don't need to use asyncio. If you wanted to run the flows on the same machine, you could use a bash script to execute the flows in different parallel processes

Copy code

#!/bin/bash
python /path/to/flow.py &
python /path/to/flow.py &
python /path/to/flow.py &
...

If you want to trigger the flows using the API and have prefect agents carry out the execution, then you need prefect agents running somewhere. You could have a bunch of prefect agents running on the same machine, or set up cloud infrastructure with one or more prefect agents running on each virtual machine. To run 1000 flows at the same time, you would likely not be able to do that on the same machine, and need to set up cloud infrastructure to handle the load

2 Views

Open in Slack

Previous Next