What should be the expected behaviour if I do the following Prefect Community #ask-community

What should be the expected behaviour if I do the ...

Josh Paulin

10/14/2022, 5:16 PM

What should be the expected behaviour if I do the following: 1. Run a flow with caching enabled on the tasks 2. Delete the cache files 3. Rerun the flow This would be using the built in

task_input_hash

as the

cache_key_fn

? On step 3 I’m seeing the flow fail that it can’t find the cache file…

✅ 1

Christopher Boyd

10/14/2022, 5:25 PM

I could be mistaken here , but I believe the input_hash is stored with the state of the task in the cloud / DB as well, so deleting anything local has no real bearing

Christopher Boyd

10/14/2022, 5:25 PM

but based on this, what is the goal? is it to cache the task, or not cache the task?

Josh Paulin

10/14/2022, 5:26 PM

I definitely want to cache the task, but this implies that I have to keep my cache files around indefinitely.

Christopher Boyd

10/14/2022, 5:28 PM

there is a cache expiration , how long do they need to be kept for?

Christopher Boyd

10/14/2022, 5:31 PM

Copy code

When each task run requested to enter a Running state, it provided its cache key computed from the cache_key_fn. The Orion backend identified that there was a COMPLETED state associated with this key and instructed the run to immediately enter the same COMPLETED state, including the same return values.

Task results are cached in memory during a flow run and persisted to the location specified by the PREFECT_LOCAL_STORAGE_PATH setting. As a result, task caching between flow runs is currently limited to flow runs with access to that local storage path.

Christopher Boyd

10/14/2022, 5:31 PM

what cache files are you deleting?

Josh Paulin

10/14/2022, 5:34 PM

This is just me testing locally. In Prefect 1 I would store results in an S3 bucket with a short lifecycle (~2 days), since they weren’t really relevant past that. Now that remote storage for results is available I was hoping to get back to that. I guess to replicate that I have to set a cache_expiration as well?

Josh Paulin

10/14/2022, 5:37 PM

I guess my expectation would be that it would fall back to re-running the task if for whatever reason the cache was unavailable.

Christopher Boyd

10/14/2022, 6:04 PM

results aren’t quite the same as cacheing

Christopher Boyd

10/14/2022, 6:05 PM

persisting output of a function would be results, while cacheing the task just says “I already did the thing, I don’t want to redo the thing”

Christopher Boyd

10/14/2022, 6:05 PM

https://docs.prefect.io/concepts/tasks/

Josh Paulin

10/14/2022, 6:18 PM

Not sure I fully understand the nuance in that distinction, since caching relies on results being persisted.

Christopher Boyd

10/14/2022, 6:26 PM

The distinction is that cacheing a task without results, means just the state is returned - “I already downloaded that file, and the inputs have not changed, so the state was cached. We will skip the task because it has already been completed and the state showed cached”

Christopher Boyd

10/14/2022, 6:27 PM

which is separate from “I need to run some computationally expensive operation and save the output” - in this case, there would be cache on the inputs, and IF the inputs have not changed, the persisted results would be re-used (the actual value / output of the task)

Christopher Boyd

10/14/2022, 6:27 PM

I’d treat the cache as just a returned state to not re-run an operation, which then is useful for results

Josh Paulin

10/14/2022, 7:05 PM

Got it. I guess back to the original question though, it seems like I have to set an explicit cache expiry on the task if I want to be able to delete the results files at some point and re-run the flow/task with the same parameters?

Christopher Boyd

10/14/2022, 7:08 PM

yep

Christopher Boyd

10/14/2022, 7:09 PM

Basically cache invalidation which is certainly a difficult thing in general to do successfully

👍 1

✅ 1

Josh Paulin

10/17/2022, 1:14 PM

Follow up question that I’ve been mulling over. In this scheme is there anyway to manually invalidate the cache short of doing a deployment? Take the following scenario: • Two flows A and B that are linked via an orchestrator flow (B depends on the results from A) and run hourly. • All the tasks in A and B are cached with a 1 week expiry • We find out that a bug in A has been producing the wrong results. The bug gets fixed and all the data in A is corrected • We need to re-run B since it’s been producing bad data this entire time, ignoring any cache that may exist

Christopher Boyd

10/17/2022, 1:54 PM

I’ll have to check with the team - I would expect if the inputs have changed as a result of bad data, you aren’t using the same inputs and consequently the cache wouldn’t be set, but I can check if there is a way to manually invalidate cache

Josh Paulin

10/17/2022, 1:56 PM

In this case the inputs are disconnected, since the flows are not directly linked except through the orchestrator. Imagine the inputs being keys in S3, where those have not changed, but the underlying data that is in those buckets has since it was produced by flow A

Christopher Boyd

10/17/2022, 2:24 PM

Could you set the task_input_hash as the hash of the output data in the s3 bucket?

Christopher Boyd

10/17/2022, 2:25 PM

as the data itself changes (since that’s what you need to not be cached) your input hash would be different, invalidating the previous data

Josh Paulin

10/17/2022, 2:32 PM

Hmm. I hadn’t considered that but I think it could work. Let me play around with it for a bit.

2 Views

Open in Slack

Previous Next