Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

:wave: Hi all

I'm evaluating Prefect 2 for a few ML data pipelines in my team and I'm a little confused about what the right way to achieve what I need. This feels like a really basic use-case so I'm sure I'm just missing something obvious.

To make things simple, I have an initial data ingestion flow that brings data into S3 (let's call it `ingestion_flow`). A couple of notes about this:
• `ingestion_flow` runs every day and to ingests a data partition for that date. The flow is parameterized with the `date` to ingest.
• the flow is organized into a few tasks to ingest data from different clients. So the flow just runs a for loop running `client_ingestion_task(client, date)` for each `client`
Once ingested this data is useful for a bunch of downstream pipelines. So I'd like to re-use `ingestion_flow` as a starting point for two other flows, say, `training_flow_1`  and `training_flow_2`.

The way I thought this could be accomplished was by simply running `ingestion_flow` as a subflow at the beginning of both `training_flow_{1,2}`. Obviously I don't want to ingest the same data twice. I created a caching key for `client_ingestion_task` using the `client` and `date`  parameters. However I quickly stumbled upon the fact that <https://docs.prefect.io/concepts/tasks/#caching|task caches only rely on local storage>. I'm running these flows in kubernetes and I was hoping for flows to run concurrently ensuring that ingestion only happens once and the cached status gets re-used in the second flow to run.

The fact that task caches are restricted to local storage tells me this is not the way to implement what I wanted. Can someone point me in the right direction here? Happy to discuss more details if that's needed.

It seems like your question is related to how to store your results in a remote storage. <https://docs.prefect.io/concepts/results/#result-storage-location|Here> is how you can do that.

Thanks a lot, <@U03GUVCV8DS>. That looks exactly like what I needed to know.