What's the recommended way to pass pandas DFs betw...
# prefect-community
c
What's the recommended way to pass pandas DFs between tasks? With big DFs (1GB+) serialization is taking a lot of time.
👀 1
1
a
Wondering the same
1
j
Kind of new to Prefect, but from my experience working with orchestration tools (like Prefect, Airflow), they should be used more to orchestrate and less to transform data (specially when data starts to get big). You could leverage Prefect + Spark for example as an alternative. For your use-case specifically, what you can also try to do it to persist the data in an intermediate storage layer (like s3, gcs) using parquet, and instead of passing the whole dataframe between tasks you can pass the file path to the new task
a
@Joao Moniz What would you recommend for the case where the data can clearly fit in memory, but prefect are starting to give some problems? Spark feels like a unnecessary complication that will probably make the code run slower
j
Hi Andreas. Not sure, if is something related to Prefect might be better to start a new thread with the logs, someone from their team might be able to provide some technical help. I was talking more like a general architecture, but you are right, if the data is small (fits in memory) it doesn't makes sense to add Spark complexity to it.
c
I am going to try caching data via parquet and not rely on prefect
My DFs do fit in memory (big servers), but moving data between workers in Dask might be my bottle neck
👍 1
j
This tracking PR on Results encompasses a good number of PRs to improve results. The improvements there might be helpful. @Zanie might have more insight.
c
Nice, will follow. Thanks!
👍 1