https://prefect.io logo
Title
w

wonsun

09/07/2022, 9:38 AM
Hi all~ I made a flow in prefect 1.0 that reads more than 1 million datas from the database and processes those data. When testing to see if the flow was runned well, I didn't test it in such a large amount, so it worked well, but when I run the flow with such a large amount of data, a broken pipeline error occurs. This may be due to the max_allowed_packet setting of RDS being used. And I registered that flow to the cloud and tried to run it, but it was also because there was too much data, so I couldn't get past the first task. Anyway, what I'd like to ask for help is to make the flow run only once for processing a large amount of data. I'm wondering if it's possible to set the number of data the task will be executed on. Like determining the batch size in deep learning. Or do i fix the amount of data to be executed in one flow, and when that amount of data is processed, the flow will end and restarting the same flow over and over to complete all the data? For example, if the total amount of data is 1 million, i fixed the amount of data processed every time is 10,000 and run the flow 10 times. Any idea to handle such heavy data, i completly welcome with open arms.🤗
1
c

Christopher Boyd

09/07/2022, 12:56 PM
HI @wonsun - I believe typical practice when processing large datasets is pagination and splitting that data up into chunks. Without knowing too much more of what you are doing., or how that data is retrieved, I think it would be a good idea to split the 1 million data points out into some sort of data structure / dataframe, and process them one at a time (per frame, not per data point).
If the ordering of the processing doesn’t really matter, this also could be a good scenario for dask usage and distributing that processing
w

wonsun

09/07/2022, 10:07 PM
Thanks Christoper! So you mean, define the amount of dataframe that a single flow processes in the tasks? And i'll see dask executor 🙂
c

Christopher Boyd

09/09/2022, 8:26 PM
That's right , I'd basically load only so much data to be processed over a number of tasks me not one big task