Basically I wonder what happens when I create a Da...
# prefect-community
j
Basically I wonder what happens when I create a Dask dataframe within a task - can I still take advantage from parallelization over partitions of the collection when transforming it?
c
Yea sure! If you aren’t actually connecting to a dask cluster and just utilizing dask’s local parallelism, you could run Prefect with a
LocalExecutor
with no issues, and let your individual tasks deal with dask. Alternatively, if you do want to submit your Prefect tasks to a dask cluster, you can still interact with that cluster from within your tasks using a dask
worker_client
j
Thank you for the quick reply - I'm actually after the latter case. Would it be sufficient then to use worker_client as a context manager and operations inside like dd.read_parquet() will automatically take advantage by submitting more tasks?
j
@Jan Therhaag we have been doing the steps in your flow (read parquet files to Dask DataFrames, merge them, transform, write out to other parquet files) successfully using Prefect and a long-running Dask cluster. We're not using Kartothek, but the idea should be the same. Dask parallelism works as usual, it's just that a Prefect task running on a worker will submit task(s) to the Dask scheduler (e.g. ddf = dd.read_parquet()) which will get distributed to other Dask workers.
🚀 2
j
Great, will give it a try!