h

    Hawkar Mahmod

    1 year ago
    Short and sweet I hope. But it isn’t clear to me from the docs where data is cached output data is persisted when using Cloud backend. I know locally it’s stored in memory, and you cannot use the cache locally unless you make use of a backend API, but does this mean that the output data is actually stored in the backend - Cloud or Core server? If so, does this not violate the principle that no data flow data should be stored on the API side?
    emre

    emre

    1 year ago
    Actual task outputs are stored somewhere on your infra. API side only stores where that somewhere is. Ex: task output stored as pickle in S3, API stores the s3 path:
    <s3://your_bucket/your/path/to/pickle>
    Check out results for more details:https://docs.prefect.io/core/concepts/results.html#results
    h

    Hawkar Mahmod

    1 year ago
    But from what I can see you are supposedly able to use the caching functionality without setting any Result subclass on the task, so how does it know where to store this?
    emre

    emre

    1 year ago
    i dont think you can 😅 .
    So caching docs says that the default cache is the
    prefect.context
    object, which is in memory, and therefore short lived. The persisting output section notes that you need a
    Result
    object to explicitly specify where and how your data will be stored.
    h

    Hawkar Mahmod

    1 year ago
    It says that the cache is stored in context when running Prefect Core locally. What about when registering and running against the backend? That’s what’s not clear. The way the documentation is laid out seems to imply these are two different things, caching and persisting output. In fact the example given makes no mention of Result’s at all when demonstrating output caching. https://docs.prefect.io/core/concepts/persistence.html#output-caching
    emre

    emre

    1 year ago
    I see, the docs really are ambiguous in that sense. I've done some test runs on prefect server, and caching with only a
    cache_for
    and
    cache_key
    isn't good enough. First run notes that the cache is invalid, and runs tasks normally. Subsequent runs mark the task as cached, meaning a cache has been found, but passes
    None
    to downstream tasks, failing the flow. Adding a
    result
    parameter, specifically a
    LocalResult
    object made the non-cached task run persist its output, and subsequent runs used the cached value successfully. Here is what I think is going on: A
    Result
    merely exists as a way to persist task outputs to somewhere. It does not have to be involved with caching. For prefect core runs, the cache is simply
    prefect.context
    .
    Result
    configurations aren't involved here at all. For server/cloud runs, the cache is stored on the API side. But the cached value is not the data itself, but the
    Result
    objects location. If the server API determines that a task has a valid cache, the cached location is used, alongside your
    Result
    configuration, to retrieve the actual value of your data.
    Michael Adkins

    Michael Adkins

    1 year ago
    You're on the right track!
    Results
    bridge the gap between the API and your runtime environment. Since the API is designed to maintain separation from your data, we need a way to tell the API where the data is stored in your own infrastructure.
    The output caching without a result (per the linked doc) is all within a single flow (because it is stored in memory, it is not persisted). This is helpful if a single task may be called multiple times in the same flow.
    Jeremy Tee

    Jeremy Tee

    1 year ago
    @Wai Kiat Tan
    h

    Hawkar Mahmod

    1 year ago
    @emre thank you for that exploration and summary - very helpful. @Michael Adkins thank you also. Is the output caching also used to retry flow runs?
    Michael Adkins

    Michael Adkins

    1 year ago
    When you use a result / checkpointing then they can be used for retries
    Without including a result type, the output cache is ephemeral