Hi all! I’m trying (again…) to write an `HdfsResul...
# ask-community
d
Hi all! I’m trying (again…) to write an
HdfsResult
class (a la `luigi.contrib.hdfs.target.HdfsTarget`; docs) but I’m stuck on the patterns the
prefect
codebase uses for the
location
attribute. It seems like the
Result
base class implements an interface that allows
location
at both initialization time and when calling
exists(...)
or
read(...)
. I guess my question is: why? Is it not enough to restrict the user to only pass
location
at initialization time and use that value throughout the
exists
and
read
methods? Edit: follow-up question: why does the
prefect
codebase follow the pattern of creating a new
Result
instance during
Result.read(…)
? As opposed to updating the current value (
self.value
) instance?
k
My best guess is to support mapping. If you map with a result and use the
self
, it will not be thread safe and with multiprocessing like on a LocalDaskExecutor, you may find that some files are overwritten an others are not written out because they ran very close to each other. Taking in a value gives your the flexibility to update it and be thread safe
If you use templating , you need to replace the value of the location
1
d
Thanks @Kevin Kho! Can I ask about the role of the result serializer? For example, if we want to work with PySpark
DataFrame
objects, I’d want to set
new.value = spark.read.parquet(…)
. But from the other
Result.read(…)
examples, it seems like
prefect
actually wants us to serialize that data. Is that correct? If so, why? I know there’s a
PandasSerializer
(docs), but I’m not sure of the equivalent
SparkSerializer
🤔
From the top-level
Results
page (link) it says:
In addition, you can specify a 
Serializer
 that transforms Python objects into bytes prior to being written to storage by a 
Result
. The same 
Serializer
 will be used to recover the object from bytes later.
This makes the serializer seem optional — is that correct? I’m not sure it’s sensible to try pickling a Spark DataFrame based on the distributed memory model?
k
It is optional. You can use a serializer like this . I think it’s just the default might be PickleSerializer or JSONSerializer. Yes I don’t think you can serialize Spark DataFrames. But I think Prefect’s decision to serialize by default is for efficiency and to speed up saving/loading, which it does by default if checkpointing is on
1
If I were writing a result and didn’t want a serializer, I would maybe push that NoOp Serializer and make it the default serializer of my Result class
d
So by default it doesn’t look like the base
Result
class enforces a serializer (i.e,
Result.__init__
defaults
self.serializer
to
None
if not provided). Would you use a
NoOpSerializer
just so that there technically “is” a serializer for the result? Is there some other prefect interface that does prefer a result having a serializer (even if it’s a no-op)?
k
Ah ok my understanding was wrong. The PrefectResult is a good sample of attaching a default. I guess I would just add nothing yep!
d
Ahhh okay gotcha! I guess explicit is better than implicit, so maybe I’ll borrow that
NoOpSerializer
🙂 Thanks again @Kevin Kho!
k
Of course!