Hey all, I have a pipeline that takes a list of .t...
# prefect-cloud
d
Hey all, I have a pipeline that takes a list of .tsv files as input and does some processing on the data then writes the data to dbs. The whole pipeline with 61 files as input takes around 20 mins to run locally. However, when I run it in prefect cloud through a cloud run job (4GB memory and 1 cpu) it takes around 4+ hours to run on all the files. What is the best way to increase performance when running a pipeline in prefect cloud? My cloud analytics are saying that CPU utilisation is peaking around 60% and memory utilisation is never above 30% indicating that there aren’t CPU or memory utilisation issues. I’ve done some research around task runners: https://docs.prefect.io/latest/concepts/task-runners/ . Would this be my best option for increasing performance? Any help would be much appreciated thank you in advance
n
hi @Daniel - running against cloud shouldnt introduce that much latency unless you're getting rate limited with retries and exponential backoff or something like that can you share the structure of your pipeline? i.e. how are you making use of concurrency etc?
d
The general structure of the pipeline is 1 flow with 8 tasks and another subflow with 1 task in that. I don’t believe I am making use of concurrency in any way at the moment and was not aware of how to make use of it in prefect
n
do your tasks depend on each other? i.e.
Copy code
foo_task_result = foo_task()
bar_task(foo_task_result)
d
Yes all 8 tasks in the main flow depend on each other
And I think I’m projected to have 12000+ task runs
n
I see, concurrency via `map` or something like that wont work here then
And I think I’m projected to have 12000+ task runs
is that because you're running it 12000 // 8 times?
perhaps its possible to write this so that you can run subflows (each containing 8 sequential tasks) concurrently?
d
It’s roughly 12000 runs because it takes approx 212 task runs to run the pipeline on 1 file and so as I’m inputting 61 files its around 212 x 61 = approx 12000 although it varies as the files are all slightly different sizes
Would writing more subflows increase speed?
n
hmm im probably missing some context, but i dont quite understand the setup
it takes approx 212 task runs to run the pipeline on 1 file
Would writing more subflows increase speed?
subflows are just functions (like tasks) that group tasks (or other normal python code). im saying that if you can run these functions (that each call your 8 dependent tasks) concurrently, then yeah that would be faster