I have a pipeline that takes a list of inputs and runs a model on each input. The inputs are handled...

Paweł Biernat

09/11/2024, 9:04 AM

I have a pipeline that takes a list of inputs and runs a model on each input. The inputs are handled by a top-level flow, that batches them (for performance reasons) and passes them to sub-deployments. Some inputs in may have been already processed previously, and I'd like to just skip their processing. I was thinking of using the built-in prefect caching to do that, but the batching gets in the way (it's often a random mix of processed and unprocessed entries). I was thinking of directly hooking into the caching mechanism and manually handle the caching on the input-level. Would that make sense? Or maybe there's some clever trick with built-in caching I could use.

Janet Carson

09/11/2024, 7:23 PM

Can the first flow just figure out which are already processed and not dispatch them at all?

Paweł Biernat

09/12/2024, 7:13 AM

Yes, that's the plan, I was just wondering if there's a way to use the caching to figure out which ones have been run already. I wanted to avoid output checks as these can be awkward to implement because I'd need to deduce the file name based on the inputs. I'm looking for a more black-box approach.

Janet Carson

09/12/2024, 4:19 PM

If you cache the result of checking "is this file already processed" - well, after you process it that cached result is wrong. It really sounds like you should use your own caching and not hook into Prefect's for this.

👍 1

2 Views

Open in Slack

Previous Next

Prefect Community

Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.