Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

Hi there, I have a large data set that I want to run. It’s taking a long time. I want to break down my dataset to smaller batches, and run each batch in parallel. Is there a good way to do it using Prefect workflow? Thank you!

What is taking long? Loading or processing?

It’s a raster file. I need to convert the raster file to vectors (Point format), and upload to BigQuery. The transformation from raster pixel (row,col) to Point(lon, lat) is taking forever

Though the original file is a raster, I read it using Rasterio, and it’s an array of values, and I turned it into a list. So I’m basically dealing with a list of over 80 million elements

Ah ok are you using <https://docs.prefect.io/core/concepts/mapping.html|Mapping> already?

No, I’m not. I’m using the Parallel function in Joblib

That sounds like it should add parallelism already though right? A bit confused what more you are thinking of adding?

It does, but not enough. We had our Prefect 30day onboarding meeting yesterday, and one suggestion we got from George is to build a flow of flows. One main flow that breaks down the dataset, and pass each batch to a subflow. Theoretically, the subflow can be called multiple times and run simultaneously. I just don’t know how to do it.

Have you seen <https://docs.prefect.io/core/idioms/flow-to-flow.html#scheduling-a-flow-of-flows|this> page? But I don’t understand because even if you create a Flow of Flows, are you providing more hardware somehow? Is the flow of flows happening on Kubernetes or ECS or something like that? Because the Flow you have now already uses the cores of your machine. Using a flow of flows can still use the same hardware right? I think you should be looking at scaling using the DaskExecutor

Ah okay, thanks. They also mentioned Dask, I’ll look into it