Harrison Kim

09/15/2022, 7:35 PM
Hello! I have a question regarding how resources are being used in Prefect. I am using the DaskExecutor with LocalCluster(2 workers, and 4 threads per worker) and am testing processing 800 files which I am splitting evenly into 8 batches. Originally I had the prefect task running each of the batches passed to them and processing those in a loop, but when I change this to explicitly run threading within the tasks (still mapping to them) using the existing client like how Kevin suggested in this POST, I am seeing significant time savings (25 - 50% faster). My question is are there drawbacks to performing batch processing this way/is this the right way to do this or were we not utilizing Prefect in the correct way previously?

Christopher Boyd

09/15/2022, 8:55 PM
Hi Harrison, I’m not really an expert on Dask or the processing here, but is there a concern that you have with running these operations? My instinct would say that as long as things are working and well you should be good. There are many ways to tackle a problem, and some are definitely more efficient than others

Harrison Kim

09/15/2022, 9:13 PM
Thanks Chris! There does not appear to be any issues when I look at the scheduler but there just was not much documentation around doing it this way. I just wasn't sure if there is any concern with if I explicitly calling the client this way to leverage threads. Does it compete with how Prefect uses the scheduler behind the scenes or would it cause problems down the line? Just because in this post I linked it says "using dask without* mapping" but I am using Dask with* prefect's mapping and I don't want to set myself up for failure as we scale up since I do not completely understand how Prefect uses the dask executor behind the scenes