What's the best way to ignore an existing checkpoi...
# prefect-community
s
What's the best way to ignore an existing checkpoint and overwrite it?
c
Hi Scott - ignoring and overwriting is the default behavior of checkpoints; perhaps you’re asking about the
target
keyword?
s
Hm, yea I guess I am getting confused about the relationship between the two. I thought that the target was the location name to be used by the Result class when checkpointing. In my case, I created a custom Results subclass that reads/writes to a database. So I expect the target to set the table name used to write to the database. However when I specify a target and set checkpoint=True, it always reads from the database when the target exists. I'm wondering how I can turn that behavior off to rebuild a checkpoint table in the event that I modify the source code of my task, but not the target table name.
c
Yup I understand your confusion -
target
is actually a special keyword that is related to file-based caching (so if the file exists, the task is not rerun); here is some documentation on `target`: https://docs.prefect.io/core/idioms/targets.html (we definitely need to expand our documentation on results and caching!) What you can do instead is:
Copy code
@task(checkpoint=True, result=MyResultType(location="{same-template-you-used-for-your-target}")
^^ that will still checkpoint your data to a templated location but will not re-use that data on subsequent runs (and will instead overwrite the data)
s
Hi Chris, still confused here on terminology. When I'm developing and debugging a single task, I want to call its run method and persist the task's output every time to a database. Moreover, when this task requires the output of a previous task as input, I want to load this input value from the persisted state of the previous task (also stored in the database). Furthermore, given two tasks: 1. result1 = task1() 2. result2 = task2(result1) It would be nice to be able to run [task1, task2] in sequence and regenerate (and persist to the database) both result1 and result2. Alternatively, if I just want to run task2, I'd like to be able to read resul1 from its persisted state, and run task2 to generate result2 without rerunning task1. Hope that makes sense. Should I still be using location here over target? To be clear I am persisting each task result to a database table using a custom Results class similar to the S3 class.
I guess TLDR is that non-skipped tasks would regenerate their persisted outputs, while skipped tasks would return their persisted values when used as inputs to non-skipped tasks.
c
I don’t think I’m completely following, so let me just describe these two kwargs to you: -
checkpoint=True
+ a result
location
template (note you can template these locations based on task inputs / timestamps / etc., which is sometimes useful for “overriding”): every time this task runs, it will store it’s output data to the provided location. If you ever want to “rehydrate” an upstream task’s state you’ll have to do this manually using the
load_result
method on all
State
objects - `target=location_template`: when a
target
is provided, the location template is first checked — if data is present at the location, it is used and the task is not rerun. If no data is present, the task runs and stores its output in the provided location. As before, you can template these locations to provide for some interesting functionality. If you ever want to force a rerun of a task, you’ll need to manually delete the data in the location yourself (this is something we do want to support automatically at some point, but it’s still under discussion)
s
Thanks, this clarifies everything for me. Sounds like I could get what I'm looking for using either persistence method with some customization. Appreciate it!
c
Anytime! Glad I could help 🙂
@Marvin archive “What is the difference between checkpoint and target?”
p
+1 on being able to force the task(s) to ignore (replace) the cached values without having to manually delete the files.
c
Yea this definitely makes sense; it would be very easy to do for all tasks simultaneously via a special context key / value pair that you could set on a per-run basis. Being able to control this on a task run-by-task run basis would be trickier
p
Having this for all tasks would be a good starting point. I agree that it would be more difficult to do it selectively.