Fernando Silveira
11/17/2022, 6:21 PMingestion_flow). A couple of notes about this:
⢠ingestion_flow runs every day and to ingests a data partition for that date. The flow is parameterized with the date to ingest.
⢠the flow is organized into a few tasks to ingest data from different clients. So the flow just runs a for loop running client_ingestion_task(client, date) for each client
Once ingested this data is useful for a bunch of downstream pipelines. So I'd like to re-use ingestion_flow as a starting point for two other flows, say, training_flow_1 and training_flow_2.
The way I thought this could be accomplished was by simply running ingestion_flow as a subflow at the beginning of both training_flow_{1,2}. Obviously I don't want to ingest the same data twice. I created a caching key for client_ingestion_task using the client and date parameters. However I quickly stumbled upon the fact that task caches only rely on local storage. I'm running these flows in kubernetes and I was hoping for flows to run concurrently ensuring that ingestion only happens once and the cached status gets re-used in the second flow to run.
The fact that task caches are restricted to local storage tells me this is not the way to implement what I wanted. Can someone point me in the right direction here? Happy to discuss more details if that's needed.Khuyen Tran
11/17/2022, 7:31 PMFernando Silveira
11/17/2022, 7:33 PM