Hi All - is there a best practice when it comes to parallelization within tasks (I'm using local Dask cluster for reference)? My initial flow used task mapping but there were tens of thousands of mapped items which quickly burn through the free tier limit. My current thought is to just test basic parallelization (e.g., Python's multiprocessing) within a task but I worry that it will interfere with Dask's use of resources Thanks in advance for any suggestions
k
Kevin Kho
07/19/2022, 9:33 PM
You can do that but just remove the DaskExecutor because Dask will not allow the two stage parallelism which is what will happen. You can just use LocalExecutor and then use Dask/multiprocessing inside the task for the most part
s
Seth Goodman
07/19/2022, 9:38 PM
Thanks for the quick response. So any use of parallelization within tasks would ultimately mean giving up task parallelization?
k
Kevin Kho
07/19/2022, 9:47 PM
You should not have both because two stage parallelization can cause resource contention
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.