Does anybody have a best practice for storing larg...
# ask-community
e
Does anybody have a best practice for storing large data as a
Result
in 2.6+ ? I am thinking of passing data between tasks, persisted into remote storage as parquet, for example (e.g. using spark)
1
j
You can use any of our available storage blocks to store your results.
e
I am a bit confused since the default serializers imply that I can turn my data into bytes in memory. This is not possible for large data
z
Hey Evan, we don't serialize the result unless it is persisted because it has a lot of overhead and restricts the types of data users can pass between tasks.
You can disable the in memory cache and load the persisted serialized data to reduce memory consumption for large data
e
I’m still a but fuzzy on the details so I’ll give an example. I want to pass a spark dataframe between tasks and save the intermediate out to s3. I should disable the in memory cache, and write a serializer that saves/loads a spark dataframe to s3?
Does it make sense to do this pattern? I was testing out with serializer that saves to remote location and returns the uri of the location, and the deserializer reads from the uri
I understand its a bit of a misnomer
z
Do you want the spark dataframe to pass between tasks in-memory or do you want the downstream task to read it from s3?
e
reading from s3 i think is fine
but It would be nice if it read from s3 for a cache hit but otherwise passed the df in memoryu
z
If you use a
cache_key_fn
and set
result_storage
to S3 it’ll persist the return value to S3 on completion. If there’s a cache hit, it’ll be pulled from S3. Otherwise, it’ll pass in memory.
If you want to drop it from memory so it always pulls from S3, you can just add
cache_result_in_memory=False
The only reason to change the serializer here is to add support for Parquet. I’m going to work on a serializer for that use-case in the near future, I think.
🙌 1
🙏 1
e
that sounds like what I want I just want to control the persistence. I want to write partitioned parquet to s3
BTW I am totally willing to contribute such a functionality or in a plugin or something
Just want to make sure i don’t butcher the design 🙂
z
That'd be sweet! You could add it in a collection? Our template makes it pretty easy to get started.
e
what’s a collection?
im going on vacation for 3 weeks so you won’t hear from me until im back 😄
z
e
tyty
Link is broken for me, did it move? Also I’m bacl
j
Yes. https://docs.prefect.io/collections/catalog/ I’ll see if we can get a redirect up. Thanks! Welcome back!
🙌 1