Hi everyone, I am trying to wrap my head around result caching :sweat_smile: . On a core only run on...

emre

01/05/2021, 1:44 PM

Hi everyone, I am trying to wrap my head around result caching 😅 . On a core only run on my workstation, I keep failing to reuse my result on a long running task. My latest attempt is as follows:

Copy code

meta_df = SnowflakePandasResultTask(
            db=SNOW_DB,
            checkpoint=True,
            result=LocalResult(dir=".prefect_cache"),
            cache_for=timedelta(days=14),
            cache_key="snow_pandas_out",
        )(query=info_query)

This persist files with arbitrary names under

.prefect_cache

. On every run I get a warning that my cache is not valid anymore, Can anyone point me to where I am doing things wrong?

Chris White

01/05/2021, 4:07 PM

Hi @emre! Each time you run the flow containing this task, are you doing so from a new process?

emre

01/05/2021, 4:09 PM

I think so, I run from the terminal and every run builds the flow, runs it and then exits back to the terminal

Chris White

01/05/2021, 4:12 PM

ok gotcha - so when using

flow.run

alone, the storage of all previous cached runs occurs in memory; this means that if you call this from new processes they have no way of sharing information. However, there is a relatively simple workaround: all cached states from all tasks are stored in

prefect.context.caches

so if you save this after each run and load it before each run, it should start behaving as you expect. Something like:

Copy code

with open(".prefect_cache/THE_CACHE.pkl", "wb") as f:
    cloudpickle.dump(prefect.context.caches, f)

# on load
with open(".prefect_cache/THE_CACHE.pkl", "rb") as f:
    the_cache = cloudpickle.load(f)
    prefect.context.update(caches=the_cache)

emre

01/05/2021, 4:41 PM

Thanks @Chris White worked like a charm! Btw, this behavior built in would be very useful for me, I would like to see it as a feature. If it does sound ok to you, I want to try add it as an option to prefect core.

😄 1

Chris White

01/05/2021, 4:43 PM

In general for any sort of stateful work we recommend people use Prefect Server or Prefect Cloud, so I’d be hesitant to include this in Core alone — it will require more configuration for caching (where to store the cache), which is already a little confusing for folks

emre

01/05/2021, 5:12 PM

I see, makes sense 🙂

3 Views

Open in Slack

Previous Next

Prefect Community

Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.