Hello guys,
I have a question related to the task RAM usage - it is possible to save results using S3 storage or other storage without keeping all results in RAM?
This is needed because I would like to process a large file, that is split into x subfiles, load each subfile in a different task, perform necessary operations, save it in the pickle file and then load that pickle file in the other task.
The goal is to keep organised tasks in the prefect flow(I would like to track each subfile task operation in the prefect tasks) and keep as minimal RAM usage as possible by not storing all data in the RAM(need to store just 1 subfile at the time in the RAM).
For now I'm trying to use S3 storage results(https://docs.prefect.io/orchestration/execution/storage_options.html#aws-s3), but it seems it is not free up RAM memory when result is saved into the pickle file.
Any ideas related with this problem?
a
Amanda Wee
05/03/2021, 10:36 AM
Yes, instead of returning the result itself, return a handle to the result. You could refer to this doc for the section on persisting user-created Results for one approach:
https://docs.prefect.io/core/concepts/results.html
d
Domantas
05/03/2021, 10:54 AM
Thank you for your response!
Just to double check if I understood it correctly: the solution for this problem would be return a S3Result location value(pickle path) and pass it for another task when it is needed to be loaded?
a
Amanda Wee
05/03/2021, 1:02 PM
Yes, that is a solution (but not the only possible one, of course). This way the location would get stored in memory by the flow for retry etc reasons, but not the big chunk of data that you're operating on.
👍 1
d
Domantas
05/03/2021, 1:23 PM
May I ask what are other approaches for solving this problem? I would like to test all options/approaches and pick one that suits the best 🙂
k
Kevin Kho
05/03/2021, 1:42 PM
Hi @Domantas, you can also try explicit garbage collection in a task after writing it out. Then like Amanda said, load it in from the next tasks.
d
Domantas
05/03/2021, 4:04 PM
Alright. Thank you very much @Amanda Wee, @Kevin Kho for help!
r
Rob Fowler
05/04/2021, 11:59 PM
I store the results in redis. I was initially sharing the redis with another app stack and I blew it up. 🙂 Just a word of warning.
I have a TTL of 23 hours so daily tasks get cleaned but can be re-run the next morning (the TTL is a RedisResult pareter so I can choose and pick.