What should be the expected behaviour if I do the ...
# prefect-community
j
What should be the expected behaviour if I do the following: 1. Run a flow with caching enabled on the tasks 2. Delete the cache files 3. Rerun the flow This would be using the built in
task_input_hash
as the
cache_key_fn
? On step 3 I’m seeing the flow fail that it can’t find the cache file…
1
c
I could be mistaken here , but I believe the input_hash is stored with the state of the task in the cloud / DB as well, so deleting anything local has no real bearing
but based on this, what is the goal? is it to cache the task, or not cache the task?
j
I definitely want to cache the task, but this implies that I have to keep my cache files around indefinitely.
c
there is a cache expiration , how long do they need to be kept for?
Copy code
When each task run requested to enter a Running state, it provided its cache key computed from the cache_key_fn. The Orion backend identified that there was a COMPLETED state associated with this key and instructed the run to immediately enter the same COMPLETED state, including the same return values.

Task results are cached in memory during a flow run and persisted to the location specified by the PREFECT_LOCAL_STORAGE_PATH setting. As a result, task caching between flow runs is currently limited to flow runs with access to that local storage path.
what cache files are you deleting?
j
This is just me testing locally. In Prefect 1 I would store results in an S3 bucket with a short lifecycle (~2 days), since they weren’t really relevant past that. Now that remote storage for results is available I was hoping to get back to that. I guess to replicate that I have to set a cache_expiration as well?
I guess my expectation would be that it would fall back to re-running the task if for whatever reason the cache was unavailable.
c
results aren’t quite the same as cacheing
persisting output of a function would be results, while cacheing the task just says “I already did the thing, I don’t want to redo the thing”
j
Not sure I fully understand the nuance in that distinction, since caching relies on results being persisted.
c
The distinction is that cacheing a task without results, means just the state is returned - “I already downloaded that file, and the inputs have not changed, so the state was cached. We will skip the task because it has already been completed and the state showed cached”
which is separate from “I need to run some computationally expensive operation and save the output” - in this case, there would be cache on the inputs, and IF the inputs have not changed, the persisted results would be re-used (the actual value / output of the task)
I’d treat the cache as just a returned state to not re-run an operation, which then is useful for results
j
Got it. I guess back to the original question though, it seems like I have to set an explicit cache expiry on the task if I want to be able to delete the results files at some point and re-run the flow/task with the same parameters?
c
yep
Basically cache invalidation which is certainly a difficult thing in general to do successfully
👍 1
1
j
Follow up question that I’ve been mulling over. In this scheme is there anyway to manually invalidate the cache short of doing a deployment? Take the following scenario: • Two flows A and B that are linked via an orchestrator flow (B depends on the results from A) and run hourly. • All the tasks in A and B are cached with a 1 week expiry • We find out that a bug in A has been producing the wrong results. The bug gets fixed and all the data in A is corrected • We need to re-run B since it’s been producing bad data this entire time, ignoring any cache that may exist
c
I’ll have to check with the team - I would expect if the inputs have changed as a result of bad data, you aren’t using the same inputs and consequently the cache wouldn’t be set, but I can check if there is a way to manually invalidate cache
j
In this case the inputs are disconnected, since the flows are not directly linked except through the orchestrator. Imagine the inputs being keys in S3, where those have not changed, but the underlying data that is in those buckets has since it was produced by flow A
c
Could you set the task_input_hash as the hash of the output data in the s3 bucket?
as the data itself changes (since that’s what you need to not be cached) your input hash would be different, invalidating the previous data
j
Hmm. I hadn’t considered that but I think it could work. Let me play around with it for a bit.