We are running a typical ETL transform on a dataset in Prefect that does not fit in memory. We are looking at the best way to do multiple transformations on the data. Currently, we have 1 task that:
• Queries the database
• Transforms the data one record at a time
• Saves each transformed record to a file
Does Prefect have a good way to do this in multiple tasks without overrunning memory? Like Task A(Read data)->Task B(first transform)->Task C(second transformation)->Task D(write data)?
s
Samuel Hinton
03/24/2021, 2:37 PM
We have a similar use case, and our prefect tasks use an external dask executor so that we have a fixed number of tasks running in parallel at any one time. Each task grabs a portion of the data, processes it, and saves it out. You can see our schematic below and it seems to work pretty well 🙂 The tasks are getting parameters, sanitising them and then a collection of getdata/process/save
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.