Hi team, I'm attempting to debug a Flow locally th...
# prefect-community
Hi team, I'm attempting to debug a Flow locally that maps over a large list of objects. A single entry in that list fails to process in a given Task, and it happens to be about 3/4 the way through the list. So iterating on logic to correct the data is time intensive, as it needs to re-compute the entire pipeline before reaching the failure point (again). The way I've debugged these types of problems in other tools (i.e. Luigi) allowed for a concept of
, where you could specify how long to persist the output of a given step in the pipeline to disk. So that, when you re-run the entire pipeline locally when debugging, it would simply read in the cached data, rather than re-compute each step that previously completed successfully. It was also nice in production, as it allowed commonly used tasks to share their output with other pipelines without needing to be re-computed. It seems Prefect has a concept of output caching, but only stores this in-memory for local runs when debugging.. which is useless for this use case of iterating on logic changes and re-running the entire pipeline again. https://docs.prefect.io/core/concepts/persistence.html#output-caching There's mention in this Slack channel to 'use Prefect Cloud', but I cannot find any tutorials or examples of how to accomplish this. So I'm looking for guidance. How would you use
in Prefect Cloud to speed up the debugging iteration process of a local Flow?
Hi Scott! You can checkpoint results for debugging using result handers (https://docs.prefect.io/core/advanced_tutorials/using-result-handlers.html). As of right now there isn’t a first-class way to persist the cache outside of the python process without using Cloud. (We happen to be working on it for Core now, you might be interested in the PIN which is here: https://docs.prefect.io/core/PINs/PIN-16-Results-and-Targets.html) If you are interested in trying out Cloud, step one would be to make a free tier account for at https://cloud.prefect.io/ and then follow this deployment tutorial at https://docs.prefect.io/cloud/tutorial/configure.html#log-in-to-prefect-cloud)
Thanks @Laura Lorenz (she/her) we have a paid version of Cloud, but I'm still not clear on how one would go about deploying a Flow to Cloud with
to then use an IDE to load the process into a debugger where we can troubleshoot why the particular data-point is failing 3/4 of the way through a mapped
. When running the Flow locally, I can do this pretty easily in PyCharm. But I would like some guidance on how to debug this using
from Cloud. Or would a custom
allow me to preserve the
on my local machine to debug?
Gotcha I see. I believe to have the flow connected to the persisted Cloud
, it needs to go through the CloudFlowRunner. You may be able to use the tips in https://docs.prefect.io/core/advanced_tutorials/local-debugging.html#use-a-flowrunner-for-stateless-execution but use a CloudFlowRunner instead, and attach your debugger to whatever you use to execute that. Full disclosure, I haven’t actually done that before, but as long as your logic changes do not invalidate the cloud cache (which by my reading would happen if you reregister the flow or replace a task in the flow with flow.replace) it can still use the cloud cache.
On the point of result handlers, I mentioned them if the situation was that you wanted to see/interact with the results after a flow run, but on the point of using them as a source for the cache: data written by result handlers by themselves do NOT automatically get picked up between flow runs, so to feed them in as the next run’s cache I believe you would have to construct and/or save the state objects from a prior flow run to feed into a new instance of FlowRunner (https://github.com/PrefectHQ/prefect/blob/84fc91b0d576a5036a3a2f379c3f8b86d90f8e86/src/prefect/engine/flow_runner.py#L189)
^ sorry, I put a “do” instead of a “do NOT” in a prime place there haha if you read that before I edited it