https://prefect.io logo
Title
e

Evan Curtin

10/19/2022, 2:57 PM
Does anybody have a best practice for storing large data as a
Result
in 2.6+ ? I am thinking of passing data between tasks, persisted into remote storage as parquet, for example (e.g. using spark)
1
j

Jean Luciano

10/19/2022, 3:46 PM
You can use any of our available storage blocks to store your results.
e

Evan Curtin

10/19/2022, 8:55 PM
I am a bit confused since the default serializers imply that I can turn my data into bytes in memory. This is not possible for large data
z

Zanie

10/20/2022, 1:39 PM
Hey Evan, we don't serialize the result unless it is persisted because it has a lot of overhead and restricts the types of data users can pass between tasks.
You can disable the in memory cache and load the persisted serialized data to reduce memory consumption for large data
e

Evan Curtin

10/20/2022, 1:44 PM
I’m still a but fuzzy on the details so I’ll give an example. I want to pass a spark dataframe between tasks and save the intermediate out to s3. I should disable the in memory cache, and write a serializer that saves/loads a spark dataframe to s3?
Does it make sense to do this pattern? I was testing out with serializer that saves to remote location and returns the uri of the location, and the deserializer reads from the uri
I understand its a bit of a misnomer
z

Zanie

10/20/2022, 2:23 PM
Do you want the spark dataframe to pass between tasks in-memory or do you want the downstream task to read it from s3?
e

Evan Curtin

10/20/2022, 2:25 PM
reading from s3 i think is fine
but It would be nice if it read from s3 for a cache hit but otherwise passed the df in memoryu
z

Zanie

10/20/2022, 2:31 PM
If you use a
cache_key_fn
and set
result_storage
to S3 it’ll persist the return value to S3 on completion. If there’s a cache hit, it’ll be pulled from S3. Otherwise, it’ll pass in memory.
If you want to drop it from memory so it always pulls from S3, you can just add
cache_result_in_memory=False
The only reason to change the serializer here is to add support for Parquet. I’m going to work on a serializer for that use-case in the near future, I think.
🙌 1
:thank-you: 1
e

Evan Curtin

10/20/2022, 2:34 PM
that sounds like what I want I just want to control the persistence. I want to write partitioned parquet to s3
BTW I am totally willing to contribute such a functionality or in a plugin or something
Just want to make sure i don’t butcher the design 🙂
z

Zanie

10/21/2022, 2:32 PM
That'd be sweet! You could add it in a collection? Our template makes it pretty easy to get started.
e

Evan Curtin

10/21/2022, 2:33 PM
what’s a collection?
im going on vacation for 3 weeks so you won’t hear from me until im back 😄
z

Zanie

10/21/2022, 2:57 PM
e

Evan Curtin

10/21/2022, 2:58 PM
tyty
Link is broken for me, did it move? Also I’m bacl
j

Jeff Hale

11/15/2022, 1:27 PM
Yes. https://docs.prefect.io/collections/catalog/ I’ll see if we can get a redirect up. Thanks! Welcome back!
🙌 1