Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

Hi all,

I’m trying to scope out the amount of steps required to build an `HdfsDataFrameResult` for checkpointing PySpark DataFrames. Imagine a flow with just two tasks A -&gt; B, where A produces a PySpark DataFrame and B uses that DataFrame. I’d like to make use of caching in case B fails, so that it can quickly a check-pointed A from disk (here, HDFS).

To make use of caching, am I correct that I’d need to:

1. Set environment variable `PREFECT__FLOWS__CHECKPOINTING=true`.
2. Implement an `HdfsDataFrameResult`` (with interface `read`, `write`, `exists`).
3. Have Task A’s `run` explicitly return a PySpark `DataFrame`.
4. Initialize Task A within the `Flow` context manager as `TaskA(checkpoint=True, result=HdfsDataFrameResult(...))` .
Is that it? I guess I’m not 100% understanding the separation between the serializer and the result, and whether I need a `HdfsDataFrameSerializer`. It seems like the serializer is too low-level for PySpark DataFrames, but I’m happy to be proven wrong :slightly_smiling_face: