Short and sweet I hope. But it isn’t clear to me f...
# ask-community
h
Short and sweet I hope. But it isn’t clear to me from the docs where data is cached output data is persisted when using Cloud backend. I know locally it’s stored in memory, and you cannot use the cache locally unless you make use of a backend API, but does this mean that the output data is actually stored in the backend - Cloud or Core server? If so, does this not violate the principle that no data flow data should be stored on the API side?
e
Actual task outputs are stored somewhere on your infra. API side only stores where that somewhere is. Ex: task output stored as pickle in S3, API stores the s3 path:
<s3://your_bucket/your/path/to/pickle>
Check out results for more details: https://docs.prefect.io/core/concepts/results.html#results
h
But from what I can see you are supposedly able to use the caching functionality without setting any Result subclass on the task, so how does it know where to store this?
e
i dont think you can 😅 .
So caching docs says that the default cache is the
prefect.context
object, which is in memory, and therefore short lived. The persisting output section notes that you need a
Result
object to explicitly specify where and how your data will be stored.
h
It says that the cache is stored in context when running Prefect Core locally. What about when registering and running against the backend? That’s what’s not clear. The way the documentation is laid out seems to imply these are two different things, caching and persisting output. In fact the example given makes no mention of Result’s at all when demonstrating output caching. https://docs.prefect.io/core/concepts/persistence.html#output-caching
e
I see, the docs really are ambiguous in that sense. I've done some test runs on prefect server, and caching with only a
cache_for
and
cache_key
isn't good enough. First run notes that the cache is invalid, and runs tasks normally. Subsequent runs mark the task as cached, meaning a cache has been found, but passes
None
to downstream tasks, failing the flow. Adding a
result
parameter, specifically a
LocalResult
object made the non-cached task run persist its output, and subsequent runs used the cached value successfully. Here is what I think is going on: A
Result
merely exists as a way to persist task outputs to somewhere. It does not have to be involved with caching. For prefect core runs, the cache is simply
prefect.context
.
Result
configurations aren't involved here at all. For server/cloud runs, the cache is stored on the API side. But the cached value is not the data itself, but the
Result
objects location. If the server API determines that a task has a valid cache, the cached location is used, alongside your
Result
configuration, to retrieve the actual value of your data.
z
You're on the right track!
Results
bridge the gap between the API and your runtime environment. Since the API is designed to maintain separation from your data, we need a way to tell the API where the data is stored in your own infrastructure.
The output caching without a result (per the linked doc) is all within a single flow (because it is stored in memory, it is not persisted). This is helpful if a single task may be called multiple times in the same flow.
j
@Wai Kiat Tan
h
@emre thank you for that exploration and summary - very helpful. @Zanie thank you also. Is the output caching also used to retry flow runs?
z
When you use a result / checkpointing then they can be used for retries
Without including a result type, the output cache is ephemeral