https://prefect.io logo
w

wiretrack

01/04/2021, 2:37 PM
Hey all, have a doubt regarding Dask. In an ETL that leverages Pandas quite a lot, should I stick with the Dask Executor alone, or should I use both the executor and the Dask Dataframe? I’m having a bit of trouble understanding when to just use the Scheduler / Executor, and when to use both the Executor and the specific dataframes APIs
k

Kyle Moon-Wright

01/04/2021, 6:27 PM
Hey @wiretrack, The Dask Executor is useful to parallelize your Prefect tasks regardless of the type of dataframe you’re using. Feel free to use Pandas as much as you like! I think for simple tasks Pandas is perfectly viable, but Dask dataframes can be useful for large amounts of data. I found Dask’s dataframe tutorial pretty useful to get a feel for this.
w

wiretrack

01/04/2021, 7:02 PM
Hey @Kyle Moon-Wright thanks a lot! I’ve been reading a few articles, and from what I understood, that Executor is more of distributed engine (somewhat similar to what spark does, in the sense that distributes computing across a cluster), so when i do
dask-scheduler
in one terminal and then
dask-workr
in another, i would be simulating multiple nodes running. and the
dask dataframe
would a way of parallelizing the dataframe calculation within the code, so nothing to do with distributed computation - in that sense it would be similar to Modin, organizing threads and processes (but not clusters)
so I could use a Modin data frame (parallelize dataframe within a Python process) within a task, and then run my Flow with the Dask Executor (distribute computation across a cluster)
does that make sense?
k

Kyle Moon-Wright

01/04/2021, 7:21 PM
Yeah, that sounds awesome.
w

wiretrack

01/04/2021, 7:23 PM
great! thanks a lot
I wonder how would that work when using PySpark, would Dask Executor distribute the distributed workload of Spark? 🤔
k

Kyle Moon-Wright

01/04/2021, 7:36 PM
Nothing ventured, nothing gained! Probably depends on how you set this up ultimately, but I think most Prefect users are typically content with just using Dask for their needs without unneeded complexity of multiple processing engines. I’d say just be careful with nested parallel computing engines in production, as I’d imagine you could inflate your resources rather quickly depending on the job. 💥
w

wiretrack

01/04/2021, 7:44 PM
yep, can imagine! awesome, thanks a lot, can’t way to blow up my SSD, CPU and Fans with Prefect PySpark, and Dask Executor locally, never to use that combo again, lol!
🎆 1
😂 1