My data pipelines are starting to get rather complex and I have an abstract idea about the type of tool I need... but not sure if it exists or what it's called
I have a process that starts with a Soundcloud Artist ID, then follows these steps:
• scrape 2 pages of latest tracks from the artist
• feed the list of tracks to chatgpt along with a prompt asking for it to identify periodic releases (e.g. weekly or monthly radio shows). I ask for the canonical radio show name + regex in the response
• for each identified release, create a new playlist + cron job that continually updates the playlist with the latest releases
I want to be able to store the data after each step, but each step should be re-runnable so I want to keep a history of the data point alongside other factors, e.g. the version of the transform script so if I update the code behind one of the steps in future I know which entries need to be re-run... or the exact wording of the prompt used in the second step
That last point is similar to the nx monorepo tool, where they cache all tasks (eg build) alongside the variable inputs for the task, so that if any input changes they know to re-run the task again. But in my case, i wouldn't want to overwrite the output -- I would want to add it to an append-only storage
The last requirement would be to react to changes in outputs, e.g. the output for step 2 is a list of radio shows, if the same step is re-run and the output is different, I want to declaritively say what should happen, e.g. if a release identified in an earlier run has been dropped, then delete the playlist & cron job