Hello Does anyone have a bit more in depth description of ho Prefect Community #ask-community

Hello, Does anyone have a bit more in-depth descr...

Kyle McChesney

08/02/2021, 8:01 PM

Hello, Does anyone have a bit more in-depth description of how results work? Specifically S3Results? What should I expect to find in those files for a given task. Example

Copy code

@task(result=S3Result('bucket', location='example.out')
def example():
  return [1, 2, 3]

Is it just a pickle file that when loaded, it recreated the list of

[1, 2, 3]

How does it work for more complicated returns, for example a task that returns a tuple or a pandas DataFrame?

Kevin Kho

08/02/2021, 8:07 PM

Hey @Kyle McChesney, results are paired with

Serializers

and the default is a

JSONSerializer

PickleSerializer

(not super sure right now). For a pandas DataFrame you would explicit define the

PandasSerializer

with your result like

S3Result(…, serializer=PandasSerializer())

Kevin Kho

08/02/2021, 8:08 PM

Default is

PickleSerializer

. The

Serializer

is used for both reading and writing.

Kyle McChesney

08/02/2021, 8:12 PM

I see. So I have a task that returns a tuple of 2 dataframes like so:

Copy code

@task(result=S3Result('bucket', location='example.out')
def data(output_url) -> Tuple[pandas.DataFrame, pandas.DataFrame]:
    res_path = os.path.join(output_url, 'results.csv')
    res_summary_path = os.path.join(output_url, 'summary.csv')

    res = pandas.read_csv(res_path)
    res_summary = pandas.read_csv(res_summary_path)

    return res, res_summary

Output URL is actually an s3 directory” url like

<s3://bucket/location/>

Would I need a custom serializer to handle this? I ran this (without a serializer specified) and it seemed to produce a file on s3 which unpickles to the second data frame

Kevin Kho

08/02/2021, 8:16 PM

For a task like this with multiple outputs, yes you would need to implement a custom serializer that takes in a tuple of DataFrames and handles it. I would honest suggest just using the result inside the task like:

Copy code

@task()
def data():
    s3_res = S3Result(...)
    s3_res.write(res)
    s3_res.write(res_summary)

def data2():
    s3_res = S3Result(...)
    s3_res.read(res)
    s3_res.read(res_summary)

But at this point the easiest way to achieve this is using the native

<http://df.to|df.to>_csv

+ s3fs to directly write to the s3 location.

Kyle McChesney

08/02/2021, 8:34 PM

is there a way to do that that still works within the framework of the Results stuff (for checkpointing, etc). Can I say “this the location of the results”, but the task will handle writing the results?

Kevin Kho

08/02/2021, 8:36 PM

Yes you can return the location like.

Copy code

@task()
def data():
    s3_res = S3Result(...)
    s3_res.write(res)
    s3_res.write(res_summary)
    return s3_res.location

or in your case both locations.

✅ 1

Kyle McChesney

08/02/2021, 8:38 PM

excellent. Thanks @Kevin Kho

32 Views

Open in Slack

Previous Next