Hi Guys, wondering if anyone did work with Spark o...
# ask-community
o
Hi Guys, wondering if anyone did work with Spark on Prefect. Is there a way to pass spark dataframes between two tasks? I am trying to run Spark in client mode and pass the dataframes between tasks.
z
I don’t believe you can pickle a spark dataframe iirc. Prefect uses cloud pickle to pass data between tasks.
k
Hey @Omar Sultan, you would need to persist it somewhere (like as a parquet) and then load it in downstream tasks
☝️ 1
z
you probably don’t want to do that tbh, you usually don’t want to pull data back to the driver which would be required to serialize it in order to pass between tasks
ofc all depends how large your dataset is, tho
k
I think when using Spark in client mode, you’re sending instructions to the remote cluster and saving as parquet is performed by the partitions so it won’t force a collect. And on the Prefect side, as long as you return the location of the file as opposed to the data, it’s just the location string being serialized.
👍 1
o
Hi guys thanks a lot for your replies
Been reading through them and looking back at my design. I think this approach is not really the way to go. Spark dataframes can't be pickled and I would eventually need to save/load the DFs between each step which is a bit counter productive for my use case.
Thank you so much, will be sticking to firing SparkSubmit jobs from Prefect as part of the flow
z
That’s what we do!
o
Thanks a lot Zach