Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

Hey all, have a doubt regarding Dask. In an ETL that leverages Pandas quite a lot, should I stick with the Dask Executor alone, or should I use both the executor and the Dask Dataframe? I’m having a bit of trouble understanding when to just use the Scheduler / Executor, and when to use both the Executor and the specific dataframes APIs

Hey <@U01HB6KRTPY>,
The Dask Executor is useful to parallelize your Prefect tasks regardless of the type of dataframe you’re using. Feel free to use Pandas as much as you like! I think for simple tasks Pandas is perfectly viable, but Dask dataframes can be useful for large amounts of data. I found <https://tutorial.dask.org/04_dataframe.html|Dask’s dataframe tutorial> pretty useful to get a feel for this.

Hey <@UUDFU628J> thanks a lot! I’ve been reading a few articles, and from what I understood, that Executor is more of distributed engine (somewhat similar to what spark does, in the sense that distributes computing across a cluster), so when i do `dask-scheduler` in one terminal and then `dask-workr` in another, i would be simulating multiple nodes running. and the `dask dataframe`  would a way of parallelizing the dataframe calculation within the code, so nothing to do with distributed computation - in that sense it would be similar to Modin, organizing threads and processes (but not clusters)

so I could use a Modin data frame (parallelize dataframe within a Python process) within a task, and then run my Flow with the Dask Executor (distribute computation across a cluster)

I wonder how would that work when using PySpark, would Dask Executor distribute the distributed workload of Spark? :thinking_face:

Nothing ventured, nothing gained! Probably depends on how you set this up ultimately, but I think most Prefect users are typically content with just using Dask for their needs without unneeded complexity of multiple processing engines. I’d say just be careful with nested parallel computing engines in production, as I’d imagine you could inflate your resources rather quickly depending on the job. :boom:

yep, can imagine! awesome, thanks a lot, can’t way to blow up my SSD, CPU and Fans with Prefect PySpark, and Dask Executor locally, never to use that combo again, lol!