https://prefect.io logo
Title
m

Meng Si

04/06/2022, 4:52 PM
Hi there, I have a large data set that I want to run. It’s taking a long time. I want to break down my dataset to smaller batches, and run each batch in parallel. Is there a good way to do it using Prefect workflow? Thank you!
k

Kevin Kho

04/06/2022, 4:52 PM
What format is your dataset in?
What is taking long? Loading or processing?
m

Meng Si

04/06/2022, 4:56 PM
It’s a raster file. I need to convert the raster file to vectors (Point format), and upload to BigQuery. The transformation from raster pixel (row,col) to Point(lon, lat) is taking forever
Though the original file is a raster, I read it using Rasterio, and it’s an array of values, and I turned it into a list. So I’m basically dealing with a list of over 80 million elements
k

Kevin Kho

04/06/2022, 5:05 PM
Ah ok are you using Mapping already?
m

Meng Si

04/06/2022, 5:12 PM
No, I’m not. I’m using the Parallel function in Joblib
k

Kevin Kho

04/06/2022, 5:16 PM
That sounds like it should add parallelism already though right? A bit confused what more you are thinking of adding?
m

Meng Si

04/06/2022, 5:28 PM
It does, but not enough. We had our Prefect 30day onboarding meeting yesterday, and one suggestion we got from George is to build a flow of flows. One main flow that breaks down the dataset, and pass each batch to a subflow. Theoretically, the subflow can be called multiple times and run simultaneously. I just don’t know how to do it.
k

Kevin Kho

04/06/2022, 5:34 PM
Have you seen this page? But I don’t understand because even if you create a Flow of Flows, are you providing more hardware somehow? Is the flow of flows happening on Kubernetes or ECS or something like that? Because the Flow you have now already uses the cores of your machine. Using a flow of flows can still use the same hardware right? I think you should be looking at scaling using the DaskExecutor
m

Meng Si

04/06/2022, 5:49 PM
Ah okay, thanks. They also mentioned Dask, I’ll look into it