Ethienne Marcelin

01/24/2023, 3:56 PM
Hi i'm trying out prefect, I would need your guidance for my project. My project is a series of "task" (would be flows in prefect language) lets say A ->B ->C ->D ->..., which are run sequentially These tasks take as input huge files, they can't fit in RAM most of the time, so we load them in chunks. They also output big files, which are then passed to the other subflows downstream. How should I manage the storage of these files (locally, uploading/downloading from buckets would be too long) leveraging prefect ? Taking into account that I may want to save intermediary results in case of a crash, or do stuff like this A ->B ->C ->D ->... | ^ v___________| Where task D takes as input the output file of A and C.
😿 1

Christopher Boyd

01/26/2023, 11:40 PM
Prefect isn’t doing anything unique here in this case - you still need the input files to process in some way or other, Prefect is just handling the pipelining and execution. If they are large files that don’t fit in memory of whatever compute you are trying to use, then you’d need to properly process / format them local on disk to be batched in chunks. Dask and ray are options for distributed work, but they are still predicated upon having enough memory in the first place
Prefect won’t solve the memory usage of your file size or storage. Regarding persisting results - it depends. If you require the actual output of the processing (e.g. there is artifact data required as input to the next), then you would need to persist those results yourself (writing to file, or however you might do that natively). If the action is idempotent (say a file or action is taken to write some piece of data into a database) and you don’t need to repeat the action, only know that the action was already taken successfully, you can use results cacheing with minimal effort

Ethienne Marcelin

02/01/2023, 3:54 PM
thanks 😍