Hopefully someone can give me a light thump on the head to s Prefect Community #ask-community

Hopefully someone can give me a light thump on the...

Nate Atkins

05/04/2020, 12:37 AM

Hopefully someone can give me a light thump on the head to show me the errors of my thinking or teach me the magic Prefect incantation to simplify how I'm trying to solve the data science workflow. I have a pipeline consisting of Query, Prepare, Train, and Evaluate tasks. Each of them is time consuming and depends on the output from the previous task. I can turn on Checkpointing, Caching with a LocalResultHandler. The developer process we used previously kept a dependency tree off what python file and source file(s) were used to generate the next file. If you deleted a data file or updated a source file (think Task) and ran the pipeline it would figure out what task needed to be run. Let say you updated the code in the Model Task then the cached output of Prepare would be used with the new model task code to generate a model that was evaluated by the Evaluate task. I think of it is as doing a reverse look up on the DAG to find the earliest upstream task that needs to be run. All downstream of any task that is run are also run and their cached results aren't used. As I'm currently using Prefect if the Model Result is missing the Model Task will be rerun, but the Evaluate Task will find existing cached inputs and blindly use them. 1. Is there some slick way to handle this pattern in Prefect that I just haven't found yet? 2. Currently I have a CacheValidatorFactory that builds the cache function with a list of source files and destination files. The validator checks to see if any source files are newer than any destination files or if any destination files are missing to determine if the cache is valid or not. If a source file is newer or a destination file is missing then the cache is invalid and the task runs. This may be related to https://github.com/PrefectHQ/prefect/issues/2104 I've also noticed that if the output of a task is cached and used that it still reads all the cached inputs. 1. Is there a way to have a different cache_validator for inputs and results? 2. If the results cache will be used can we avoid loading the cached inputs? 3. What happens if the Prepare runs, Model doesn't and Evaluate should? Not sure my example makes total sense, but how do we trace back through the DAG to see which upstream results have changed?

Chris White

05/04/2020, 2:28 AM

Hi Nate, I believe the pattern you are hoping for will be made possible with the release of PIN 16 (https://docs.prefect.io/core/PINs/PIN-16-Results-and-Targets.html), which we are hoping to have out this week 🤞

❤️ 1

👏 1

Avi A

05/04/2020, 7:27 AM

Wow! I was looking for such functionality and ended up building something sloppy of my own

Nate Atkins

05/04/2020, 2:58 PM

Thanks for the update Chris. Not sure why I hadn't put 2 and 2 together to metally alias what we are doing with "Make" like semantics. Looking forward to the new functionality.

👍 1

2 Views

Open in Slack

Previous Next