hello! beginner question: in DVC’s pipeline featur...
# ask-community
a
hello! beginner question: in DVC’s pipeline feature, a stage will only be re-executed if something in the stage definition or its dependencies changes (see https://dvc.org/doc/user-guide/project-structure/pipelines-files#stages). this is not default behaviour in prefect, right? how do i do this with prefect?
k
Hey @Alexander Seifert,retries are one on a task level in Prefect. Is this what you are looking for? https://docs.prefect.io/core/concepts/tasks.html#retries
If upstream dependencies fail in Prefect, the task won’t run by default
a
sorry, before editing the initial post was missing an important word: if something in the state definition or its dependencies changes.
so no, i’m not looking for retries, more along the lines of caching i guess. if i call the exact same pipeline twice, and the definition + dependencies stayed the same, then nothing should have to be recalculated.
k
Ah ok so for caching, Prefect has two mechanisms. The first is file-based and is called
target
. If a file exists at the target, the task won’t run. The second is caching and comprises of
cache_for
and
cache_validator
so you specify the duration that the cache is valid, and then the cache validator will determine whether something has to be re-run (based in inputs, or parameters, etc.)
a
alright, thanks! so e.g. for a data preprocessing flow i would maybe have a file-based target that encodes the md5 hash of my raw data, so when the data changes the md5 changes and then there’s no target so the flow runs again
k
not a bad idea, seems like that might be used in conjunction with our KV Store to keep track of the hash
a
or a cache_validator that calculates those md5 sums and checks against previously encountered values
k
It just needs to fit under 10KB for the KV Store but you can just persist and update that
a
alright, thanks. but it seems that what i want to do is something that needs to be pieced together manually rather than just working out of the box. just wanted to check that i’m not missing something!
k
Yes we don’t have listeners. Instead, event-based flows are normally triggered by hitting our API from the event.
a
alright, thanks!
b
@Kevin Kho @Alexander Seifert I’d be interested in a feature like this and willing to contribute. @Alexander Seifert this could be implemented as a dask graph optimisation (I’ve done this in the past).
k
Hey Brad, I’m glad you’re interesting in contributing. Could you detail your thoughts in a Github issue so the core team can see it and discuss? I’d like to learn more about the dask graph optimization
b
Sure thing - I’ll try and whip up a motivating example
Actually - on thinking about this a little more, ignore my suggestion of the dask graph optimise - that’s too executor specific. I think this could potentially be accomplished via a FlowRunner subclass. I’m going to have a play around and see if I can make something work
k
Ah that’s true
b
Hey @Kevin Kho @Alexander Seifert I opened https://github.com/PrefectHQ/prefect/discussions/4935 to discuss
And a potential implementation (very WIP) here https://github.com/limx0/caching_flow_runner
k
Can you create a new message on Slack cuz I wanna tag 3 community members interested in this?
b
ya