Hi all, I’m trying to scope out the amount of ste...
# ask-community
d
Hi all, I’m trying to scope out the amount of steps required to build an
HdfsDataFrameResult
for checkpointing PySpark DataFrames. Imagine a flow with just two tasks A -> B, where A produces a PySpark DataFrame and B uses that DataFrame. I’d like to make use of caching in case B fails, so that it can quickly a check-pointed A from disk (here, HDFS). To make use of caching, am I correct that I’d need to: 1. Set environment variable
PREFECT__FLOWS__CHECKPOINTING=true
. 2. Implement an `HdfsDataFrameResult`` (with interface
read
,
write
,
exists
). 3. Have Task A’s
run
explicitly return a PySpark
DataFrame
. 4. Initialize Task A within the
Flow
context manager as
TaskA(checkpoint=True, result=HdfsDataFrameResult(...))
. Is that it? I guess I’m not 100% understanding the separation between the serializer and the result, and whether I need a
HdfsDataFrameSerializer
. It seems like the serializer is too low-level for PySpark DataFrames, but I’m happy to be proven wrong 🙂
1