Hi all I m trying again to write an `HdfsResult` class a la Prefect Community #ask-community

Hi all! I’m trying (again…) to write an `HdfsResul...

Danny Vilela

12/15/2021, 11:31 PM

Hi all! I’m trying (again…) to write an

HdfsResult

class (a la `luigi.contrib.hdfs.target.HdfsTarget`; docs) but I’m stuck on the patterns the

prefect

codebase uses for the

location

attribute. It seems like the

Result

base class implements an interface that allows

location

at both initialization time and when calling

exists(...)

read(...)

. I guess my question is: why? Is it not enough to restrict the user to only pass

location

at initialization time and use that value throughout the

exists

and

read

methods? Edit: follow-up question: why does the

prefect

codebase follow the pattern of creating a new

Result

instance during

Result.read(…)

? As opposed to updating the current value (

self.value

) instance?

Kevin Kho

12/16/2021, 12:01 AM

My best guess is to support mapping. If you map with a result and use the

self

, it will not be thread safe and with multiprocessing like on a LocalDaskExecutor, you may find that some files are overwritten an others are not written out because they ran very close to each other. Taking in a value gives your the flexibility to update it and be thread safe

Kevin Kho

12/16/2021, 12:02 AM

If you use templating , you need to replace the value of the location

✅ 1

Danny Vilela

12/16/2021, 12:11 AM

Thanks @Kevin Kho! Can I ask about the role of the result serializer? For example, if we want to work with PySpark

DataFrame

objects, I’d want to set

new.value = spark.read.parquet(…)

. But from the other

Result.read(…)

examples, it seems like

prefect

actually wants us to serialize that data. Is that correct? If so, why? I know there’s a

PandasSerializer

(docs), but I’m not sure of the equivalent

SparkSerializer

🤔

Danny Vilela

12/16/2021, 12:14 AM

From the top-level

Results

page (link) it says:

In addition, you can specify a
Serializer
that transforms Python objects into bytes prior to being written to storage by a
Result
. The same
Serializer
will be used to recover the object from bytes later.

This makes the serializer seem optional — is that correct? I’m not sure it’s sensible to try pickling a Spark DataFrame based on the distributed memory model?

Kevin Kho

12/16/2021, 12:29 AM

It is optional. You can use a serializer like this . I think it’s just the default might be PickleSerializer or JSONSerializer. Yes I don’t think you can serialize Spark DataFrames. But I think Prefect’s decision to serialize by default is for efficiency and to speed up saving/loading, which it does by default if checkpointing is on

✅ 1

Kevin Kho

12/16/2021, 12:33 AM

If I were writing a result and didn’t want a serializer, I would maybe push that NoOp Serializer and make it the default serializer of my Result class

Danny Vilela

12/16/2021, 12:35 AM

So by default it doesn’t look like the base

Result

class enforces a serializer (i.e,

Result.__init__

defaults

self.serializer

None

if not provided). Would you use a

NoOpSerializer

just so that there technically “is” a serializer for the result? Is there some other prefect interface that does prefer a result having a serializer (even if it’s a no-op)?

Kevin Kho

12/16/2021, 12:37 AM

Ah ok my understanding was wrong. The PrefectResult is a good sample of attaching a default. I guess I would just add nothing yep!

Danny Vilela

12/16/2021, 12:37 AM

Ahhh okay gotcha! I guess explicit is better than implicit, so maybe I’ll borrow that

NoOpSerializer

🙂 Thanks again @Kevin Kho!

Kevin Kho

12/16/2021, 12:38 AM

Of course!

4 Views

Open in Slack

Previous Next