https://prefect.io logo
Title
j

Jan Therhaag

09/04/2019, 3:10 PM
Basically I wonder what happens when I create a Dask dataframe within a task - can I still take advantage from parallelization over partitions of the collection when transforming it?
c

Chris White

09/04/2019, 3:13 PM
Yea sure! If you aren’t actually connecting to a dask cluster and just utilizing dask’s local parallelism, you could run Prefect with a
LocalExecutor
with no issues, and let your individual tasks deal with dask. Alternatively, if you do want to submit your Prefect tasks to a dask cluster, you can still interact with that cluster from within your tasks using a dask
worker_client
j

Jan Therhaag

09/04/2019, 3:24 PM
Thank you for the quick reply - I'm actually after the latter case. Would it be sufficient then to use worker_client as a context manager and operations inside like dd.read_parquet() will automatically take advantage by submitting more tasks?
j

Joe Schmid

09/04/2019, 3:24 PM
@Jan Therhaag we have been doing the steps in your flow (read parquet files to Dask DataFrames, merge them, transform, write out to other parquet files) successfully using Prefect and a long-running Dask cluster. We're not using Kartothek, but the idea should be the same. Dask parallelism works as usual, it's just that a Prefect task running on a worker will submit task(s) to the Dask scheduler (e.g. ddf = dd.read_parquet()) which will get distributed to other Dask workers.
🚀 2
j

Jan Therhaag

09/04/2019, 3:28 PM
Great, will give it a try!