Matt Wong-Kemp07/24/2020, 8:58 PM
is it safe to share the dask cluster between the execution of the tasks and direct use of dask? The use case here is I have a large number of small tasks I'd like to do in parallel, followed by some large joined data analysis. I'd like to gain the concurrency from running on a dask cluster in my flows, but at the same time I want to perform some data analysis using the dask
class and distribute this across a cluster as well. If I provision the cluster myself, is it safe to share the scheduler between the flow and the dataframe library? Or should I expect to need to provision my own dask cluster inside a task to run my Dataframe code on?
Jim Crist-Harif07/24/2020, 9:01 PM
Matt Wong-Kemp07/24/2020, 9:08 PM
? dask is obviously build on coroutines and my very quick glance at the executor base class tells me it's mostly working in terms of future-shaped things - this would be very useful in my scenario where I'd like to shell out to 10k kubernetes jobs (which effectively act like coroutines rather than tasks), and then perform Dataframe analysis on the set I've got. I'm confident that this will work at the minute, but I'd like to mkae my cluster adaptable, so that I can use the concurrency needed for launching the tasks at the start, and the parallelism needed for the dataframe at the end.
Jim Crist-Harif07/24/2020, 9:13 PM
). You could then use the results downstream in prefect tasks as needed (mapping, etc...).
Matt Wong-Kemp07/24/2020, 9:26 PM
Jim Crist-Harif07/24/2020, 9:29 PM
Matt Wong-Kemp07/24/2020, 9:29 PM
Jim Crist-Harif07/24/2020, 9:30 PM
Matt Wong-Kemp07/24/2020, 9:34 PM