I have a pipeline that takes a list of inputs and runs a model on each input. The inputs are handled by a top-level flow, that batches them (for performance reasons) and passes them to sub-deployments. Some inputs in may have been already processed previously, and I'd like to just skip their processing. I was thinking of using the built-in prefect caching to do that, but the batching gets in the way (it's often a random mix of processed and unprocessed entries). I was thinking of directly hooking into the caching mechanism and manually handle the caching on the input-level. Would that make sense? Or maybe there's some clever trick with built-in caching I could use.
j
Janet Carson
09/11/2024, 7:23 PM
Can the first flow just figure out which are already processed and not dispatch them at all?
p
Paweł Biernat
09/12/2024, 7:13 AM
Yes, that's the plan, I was just wondering if there's a way to use the caching to figure out which ones have been run already. I wanted to avoid output checks as these can be awkward to implement because I'd need to deduce the file name based on the inputs. I'm looking for a more black-box approach.
j
Janet Carson
09/12/2024, 4:19 PM
If you cache the result of checking "is this file already processed" - well, after you process it that cached result is wrong. It really sounds like you should use your own caching and not hook into Prefect's for this.
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.