Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

Basically I wonder what happens when I create a Dask dataframe within a task - can I still take advantage from parallelization over partitions of the collection when transforming it?

Yea sure! If you aren’t actually connecting to a dask cluster and just utilizing dask’s local parallelism, you could run Prefect with a `LocalExecutor` with no issues, and let your individual tasks deal with dask.  Alternatively, if you do want to submit your Prefect tasks to a dask cluster, you can still interact with that cluster from within your tasks using a dask `worker_client`

Thank you for the quick reply - I'm actually after the latter case. Would it be sufficient then to use worker_client as a context manager and operations inside like dd.read_parquet() will automatically take advantage by submitting more tasks?

<@UN0G4TSVB> we have been doing the steps in your flow (read parquet files to Dask DataFrames, merge them, transform, write out to other parquet files) successfully using Prefect and a long-running Dask cluster. We're not using Kartothek, but the idea should be the same. Dask parallelism works as usual, it's just that a Prefect task running on a worker will submit task(s) to the Dask scheduler (e.g. ddf = dd.read_parquet()) which will get distributed to other Dask workers.