Hi - I'm a data scientist for the US Geological Su...
# prefect-community
j
Hi - I'm a data scientist for the US Geological Survey. I run expensive Deep Learning training tasks in pipelines. I've been using Snakemake for a while now, but Prefect has caught my attention as a potentially more cloud-friendly alternative. I have a few questions about Prefect, but maybe my biggest is is about dependency tracking. If an upstream dependency changes in a Prefect Flow, will a downstream cached task run? (I especially need the caching/persisting of task results b/c of how expensive the training steps are in my pipelines) For example, Snakemake reruns downstream tasks if the timestamp on an upstream file is more recent than the downstream file (Snakemake tracks everything as files on disk).
i
Hi @jeff sadler I am not familiar with Snakemake but
prefect
uses
targets
. See this https://docs.prefect.io/core/concepts/persistence.html#output-caching-based-on-a-file-target
c
Hi @jeff sadler and welcome! All of Prefect caching is highly configurable and based on one of the following mechanisms: - duration since the last run - inputs / parameter checks (same inputs => same output) - presence of a (possible Cloud-based) file. The filename in this scenario can be templated based on many things such as task name / timestamp / task slug. You can pick and choose which of these to apply on a per-task basis. Prefect doesn’t natively track code changes on your behalf and doesn’t currently perform any reasoning based on file timestamps but it’s an interesting idea
j
Okay. Thanks, @Chris White. That's helpful to know. For the record, that kind of tracking would be very helpful for my use cases. I have a bit of hard time picturing how it would work in a Prefect Flow though since dependencies aren't explicitly indicated as they are in Snakemake or other tools like DVC or Remake.
c
when you say “dependencies aren’t explicitly indicated” do you mean specifically cache dependencies or something else?
j
I'm thinking specifically about my case. At the top of my pipeline are files on disk (or cloud) that have raw data. Those files get read and processed and are passed as Python objects that Prefect can track. If I understand Prefect correctly, it currently doesn't keep track of the raw data files and therefore it can't know to rerun the pipeline if those change.
That's what I meant by the "dependencies are not explicitly indicated." I can't tell Prefect to keep track of and detect changes in those most upstream raw files.
c
Gotcha I see; yea Prefect templating can go a long way but Prefect doesn’t inspect the contents of the files other than checking for their existence and returning what’s in them. If you’re interested in opening a feature request for what you’re looking for we can definitely look into it!
👍 1