hi all, i have a question about whether prefect is...
# ask-community
m
hi all, i have a question about whether prefect is the tool I'm looking for. I'll post the description inside the thread - it's on the longer side, looking forward to hearing your thoughts marvin
My data pipelines are starting to get rather complex and I have an abstract idea about the type of tool I need... but not sure if it exists or what it's called I have a process that starts with a Soundcloud Artist ID, then follows these steps: • scrape 2 pages of latest tracks from the artist • feed the list of tracks to chatgpt along with a prompt asking for it to identify periodic releases (e.g. weekly or monthly radio shows). I ask for the canonical radio show name + regex in the response • for each identified release, create a new playlist + cron job that continually updates the playlist with the latest releases I want to be able to store the data after each step, but each step should be re-runnable so I want to keep a history of the data point alongside other factors, e.g. the version of the transform script so if I update the code behind one of the steps in future I know which entries need to be re-run... or the exact wording of the prompt used in the second step That last point is similar to the nx monorepo tool, where they cache all tasks (eg build) alongside the variable inputs for the task, so that if any input changes they know to re-run the task again. But in my case, i wouldn't want to overwrite the output -- I would want to add it to an append-only storage The last requirement would be to react to changes in outputs, e.g. the output for step 2 is a list of radio shows, if the same step is re-run and the output is different, I want to declaritively say what should happen, e.g. if a release identified in an earlier run has been dropped, then delete the playlist & cron job
j
By and large, yeah, I think Prefect can get you there
The hardest bit would be switching logic upon seeing differences
but that's probably just a matter of dispatching to different logic on the input side and possibly doing a bit of introspection on the DB
m
e
Hi @Matt Fysh 👋 Prefect definitely sounds like a tool that fits your use case. I like to think of Prefect as “power tools for Python”… the most important thing we provide is observability around your code so you can tell what’s run, what’s failed, and what to do about it. Prefect also has a lot of features that you’ve already seen — like caching and scheduling — that it sounds like you need.
You might also be interested in our LLM framework, marvin, which has a lot of useful features for working with chatgpt
For example:
j
@Emil Christensen - I feel like this could figure more prominently in Prefect marketing tbh, it's a USP compared to a lot of other tooling in this space. My team has inherited a lot of legacy pipelines scattered around multiple repos we weren't really keen to port fully. We just slapped the `@flow`s on them and called it a day, so the scheduling is still managed by the legacy Cron solution, but we got full observability of runs and failures, which is really handy. The same failure that took 40 mins to spot now takes 4 mins at worst
🙌 1
💯 1
e
flex tape
😁 1
@Jan Malek Love to hear that! That’s one of the great benefits of v2… we think of orchestration and observability as a spectrum and hope that Prefect serves both ends (and everything in between). More good stuff on that here.