Hey all I have a pipeline that takes a list of tsv files as Prefect Community #prefect-cloud

Hey all, I have a pipeline that takes a list of .t...

Daniel

06/18/2024, 3:01 PM

Hey all, I have a pipeline that takes a list of .tsv files as input and does some processing on the data then writes the data to dbs. The whole pipeline with 61 files as input takes around 20 mins to run locally. However, when I run it in prefect cloud through a cloud run job (4GB memory and 1 cpu) it takes around 4+ hours to run on all the files. What is the best way to increase performance when running a pipeline in prefect cloud? My cloud analytics are saying that CPU utilisation is peaking around 60% and memory utilisation is never above 30% indicating that there aren’t CPU or memory utilisation issues. I’ve done some research around task runners: https://docs.prefect.io/latest/concepts/task-runners/ . Would this be my best option for increasing performance? Any help would be much appreciated thank you in advance

Nate

06/18/2024, 3:15 PM

hi @Daniel - running against cloud shouldnt introduce that much latency unless you're getting rate limited with retries and exponential backoff or something like that can you share the structure of your pipeline? i.e. how are you making use of concurrency etc?

Daniel

06/18/2024, 3:19 PM

The general structure of the pipeline is 1 flow with 8 tasks and another subflow with 1 task in that. I don’t believe I am making use of concurrency in any way at the moment and was not aware of how to make use of it in prefect

Nate

06/18/2024, 3:20 PM

do your tasks depend on each other? i.e.

Copy code

foo_task_result = foo_task()
bar_task(foo_task_result)

Daniel

06/18/2024, 3:21 PM

Yes all 8 tasks in the main flow depend on each other

Daniel

06/18/2024, 3:22 PM

And I think I’m projected to have 12000+ task runs

Nate

06/18/2024, 3:23 PM

I see, concurrency via `map` or something like that wont work here then

And I think I’m projected to have 12000+ task runs

is that because you're running it 12000 // 8 times?

Nate

06/18/2024, 3:24 PM

perhaps its possible to write this so that you can run subflows (each containing 8 sequential tasks) concurrently?

Daniel

06/18/2024, 3:28 PM

It’s roughly 12000 runs because it takes approx 212 task runs to run the pipeline on 1 file and so as I’m inputting 61 files its around 212 x 61 = approx 12000 although it varies as the files are all slightly different sizes

Daniel

06/18/2024, 3:29 PM

Would writing more subflows increase speed?

Nate

06/18/2024, 3:32 PM

hmm im probably missing some context, but i dont quite understand the setup

it takes approx 212 task runs to run the pipeline on 1 file

Would writing more subflows increase speed?

subflows are just functions (like tasks) that group tasks (or other normal python code). im saying that if you can run these functions (that each call your 8 dependent tasks) concurrently, then yeah that would be faster

3 Views

Open in Slack

Previous Next