CA Lee
09/21/2020, 1:39 AMLaura Lorenz
09/21/2020, 2:53 PMCA Lee
09/22/2020, 1:10 AMLaura Lorenz
09/22/2020, 1:12 AMCA Lee
09/22/2020, 1:13 AM@task(cache_for=datetime.timedelta(days=1))
def get_complaint_data():
do something
raw = get_complaint_data()
parsed = parse_complaint_data(raw)
populated_table = store_complaints(parsed)
Question being that lets say some fetching of data (e.g. a web scraping script) is run on an hourly interval.
Caching would help prevent the fetching from running again, but how would I then stop the parsing and populating, based on the cached state of the fetching data step? ( as it wouldn't make sense to clean or store the same cached data again )Laura Lorenz
09/22/2020, 1:32 AMcache_keys
to mark that all of those tasks share the same cache, and thus can consider themselves cached as long as that cache key is not invalidated. See https://github.com/PrefectHQ/prefect/blob/master/src/prefect/core/task.py#L156 and the last bullet in https://docs.prefect.io/core/concepts/persistence.html#output-caching (I know the api docs says deprecated there, but I'm pretty sure it's not actually deprecated yet until https://github.com/PrefectHQ/prefect/issues/2619 is done, in which case you would move that configuration onto the result).
You could also use a custom trigger (https://docs.prefect.io/api/latest/triggers.html#triggers) since all triggers get their upstream dependency's edges and states (https://github.com/PrefectHQ/prefect/blob/b9914890dfec52610a42cd694427badafab8c8ba/src/prefect/triggers.py#L174) but depending how many other dependencies those tasks have it could get quite tricky, and afaik we don't have a published example that operates on specific upstream tasks to decide a trigger -- it should be possible, we just don't have any examples so you'd have to reverse engineer it a bit 🙂CA Lee
09/22/2020, 12:53 PM