Hello, Does anyone have a bit more in-depth descr...
# ask-community
k
Hello, Does anyone have a bit more in-depth description of how results work? Specifically S3Results? What should I expect to find in those files for a given task. Example
Copy code
@task(result=S3Result('bucket', location='example.out')
def example():
  return [1, 2, 3]
Is it just a pickle file that when loaded, it recreated the list of
[1, 2, 3]
How does it work for more complicated returns, for example a task that returns a tuple or a pandas DataFrame?
k
Hey @Kyle McChesney, results are paired with
Serializers
and the default is a
JSONSerializer
or
PickleSerializer
(not super sure right now). For a pandas DataFrame you would explicit define the
PandasSerializer
with your result like
S3Result(…, serializer=PandasSerializer())
Default is
PickleSerializer
. The
Serializer
is used for both reading and writing.
k
I see. So I have a task that returns a tuple of 2 dataframes like so:
Copy code
@task(result=S3Result('bucket', location='example.out')
def data(output_url) -> Tuple[pandas.DataFrame, pandas.DataFrame]:
    res_path = os.path.join(output_url, 'results.csv')
    res_summary_path = os.path.join(output_url, 'summary.csv')

    res = pandas.read_csv(res_path)
    res_summary = pandas.read_csv(res_summary_path)

    return res, res_summary
Output URL is actually an s3 directory” url like
<s3://bucket/location/>
Would I need a custom serializer to handle this? I ran this (without a serializer specified) and it seemed to produce a file on s3 which unpickles to the second data frame
k
For a task like this with multiple outputs, yes you would need to implement a custom serializer that takes in a tuple of DataFrames and handles it. I would honest suggest just using the result inside the task like:
Copy code
@task()
def data():
    s3_res = S3Result(...)
    s3_res.write(res)
    s3_res.write(res_summary)

def data2():
    s3_res = S3Result(...)
    s3_res.read(res)
    s3_res.read(res_summary)
But at this point the easiest way to achieve this is using the native
<http://df.to|df.to>_csv
+ s3fs to directly write to the s3 location.
k
is there a way to do that that still works within the framework of the Results stuff (for checkpointing, etc). Can I say “this the location of the results”, but the task will handle writing the results?
k
Yes you can return the location like.
Copy code
@task()
def data():
    s3_res = S3Result(...)
    s3_res.write(res)
    s3_res.write(res_summary)
    return s3_res.location
or in your case both locations.
1
k
excellent. Thanks @Kevin Kho