Danny Vilela
02/24/2021, 12:57 AMHdfsDataFrameResult
for checkpointing PySpark DataFrames. Imagine a flow with just two tasks A -> B, where A produces a PySpark DataFrame and B uses that DataFrame. I’d like to make use of caching in case B fails, so that it can quickly a check-pointed A from disk (here, HDFS).
To make use of caching, am I correct that I’d need to:
1. Set environment variable PREFECT__FLOWS__CHECKPOINTING=true
.
2. Implement an `HdfsDataFrameResult`` (with interface read
, write
, exists
).
3. Have Task A’s run
explicitly return a PySpark DataFrame
.
4. Initialize Task A within the Flow
context manager as TaskA(checkpoint=True, result=HdfsDataFrameResult(...))
.
Is that it? I guess I’m not 100% understanding the separation between the serializer and the result, and whether I need a HdfsDataFrameSerializer
. It seems like the serializer is too low-level for PySpark DataFrames, but I’m happy to be proven wrong 🙂