Hi there, I have a large data set that I want to run. It’s taking a long time. I want to break down my dataset to smaller batches, and run each batch in parallel. Is there a good way to do it using Prefect workflow? Thank you!
04/06/2022, 4:52 PM
What format is your dataset in?
What is taking long? Loading or processing?
04/06/2022, 4:56 PM
It’s a raster file. I need to convert the raster file to vectors (Point format), and upload to BigQuery. The transformation from raster pixel (row,col) to Point(lon, lat) is taking forever
Though the original file is a raster, I read it using Rasterio, and it’s an array of values, and I turned it into a list. So I’m basically dealing with a list of over 80 million elements
No, I’m not. I’m using the Parallel function in Joblib
04/06/2022, 5:16 PM
That sounds like it should add parallelism already though right? A bit confused what more you are thinking of adding?
04/06/2022, 5:28 PM
It does, but not enough. We had our Prefect 30day onboarding meeting yesterday, and one suggestion we got from George is to build a flow of flows. One main flow that breaks down the dataset, and pass each batch to a subflow. Theoretically, the subflow can be called multiple times and run simultaneously. I just don’t know how to do it.
04/06/2022, 5:34 PM
Have you seen this page? But I don’t understand because even if you create a Flow of Flows, are you providing more hardware somehow? Is the flow of flows happening on Kubernetes or ECS or something like that? Because the Flow you have now already uses the cores of your machine. Using a flow of flows can still use the same hardware right? I think you should be looking at scaling using the DaskExecutor
04/06/2022, 5:49 PM
Ah okay, thanks. They also mentioned Dask, I’ll look into it