Giovanni Giacco
05/05/2021, 9:55 AMDaskExecutor
, inside a task in order to execute any workload on that Dask cluster? In our tasks we have to deal with huge pandas dataframe and we’d like to use the same Dask cluster to parallelize our computation. Any tip?Avi A
05/05/2021, 10:30 AMdistributed.worker_client
to fetch a client for the cluster your task is running on. Here’s a redacted piece of code from my task. The important part is passing the client to dask_df.compute
with worker_client() as client:
dask_df = dd.from_pandas(df, chunksize=DEFAULT_CHUNKSIZE)
logger.debug('Universe size: %d', len(df))
dask_df['new_col'] = dask_df.apply(some_func, axis=1)
return dask_df.compute(scheduler=client)
Giovanni Giacco
05/05/2021, 10:38 AMworker_client()
?Avi A
05/05/2021, 10:46 AMworker_client
is a Dask function
from distributed import worker_client
Avi A
05/05/2021, 10:47 AMGiovanni Giacco
05/05/2021, 10:48 AMAvi A
05/05/2021, 1:08 PMKevin Kho
DaskExecutor
and have that subflow run. You might also find this Doc helpful.Giovanni Giacco
05/10/2021, 11:29 PM