I have a pipeline that takes a list of inputs and ...
# best-practices
p
I have a pipeline that takes a list of inputs and runs a model on each input. The inputs are handled by a top-level flow, that batches them (for performance reasons) and passes them to sub-deployments. Some inputs in may have been already processed previously, and I'd like to just skip their processing. I was thinking of using the built-in prefect caching to do that, but the batching gets in the way (it's often a random mix of processed and unprocessed entries). I was thinking of directly hooking into the caching mechanism and manually handle the caching on the input-level. Would that make sense? Or maybe there's some clever trick with built-in caching I could use.
j
Can the first flow just figure out which are already processed and not dispatch them at all?
p
Yes, that's the plan, I was just wondering if there's a way to use the caching to figure out which ones have been run already. I wanted to avoid output checks as these can be awkward to implement because I'd need to deduce the file name based on the inputs. I'm looking for a more black-box approach.
j
If you cache the result of checking "is this file already processed" - well, after you process it that cached result is wrong. It really sounds like you should use your own caching and not hook into Prefect's for this.
👍 1