03/27/2020, 10:24 PM
Hello there 👋, I'm a new user of Prefect, coming from the Airflow world when I happily used it for 3+ years. 🎂 I want to begin by saying that the project, the quality of the code and the quality of the documentation are outstanding 🤩 But I need some help about finding the good way to use it and the good practice 🤓 I have a standard ELT flow with a big json file (1GB) as input. For my task to run successfully on my medium machine, I combine ijson and iterator to read and write the file on disk chunk by chunk and not overload the memory (I can't stuck a 1GB json dict in memory) Then I load the file directly into my DB, without passing via python. What is the prefect-way of handling a similar usecase here? 🤔 Prefect encourage passing data from task to task in-memory but here, I offload it to disk and only pass the path of the file between task. Is there a way to pass an iterator between task instead of a single object? One way I'm thinking of doing it in a industrialized way is maybe to share a file cache between tasks. What do you guys think about it?
:marvin: 1

Chris White

03/27/2020, 11:22 PM
Hi Pierre, while in some situations returning an iterator would work, it’s not a supported way of passing data. The reason is that iterators can’t be recovered in new Python processes, which is important when you need to recover from failure. Sharing a File cache between tasks makes sense! While Prefect does support sharing “data”, the data doesn’t necessarily have to be the underlying data you are processing, it could be a dynamic reference to where that data lives Maybe others from the community will chime in on how they handle situations like this but I wouldn’t consider that bad practice!
👍 1