Fernando Silveira
11/17/2022, 6:21 PMingestion_flow
). A couple of notes about this:
⢠ingestion_flow
runs every day and to ingests a data partition for that date. The flow is parameterized with the date
to ingest.
⢠the flow is organized into a few tasks to ingest data from different clients. So the flow just runs a for loop running client_ingestion_task(client, date)
for each client
Once ingested this data is useful for a bunch of downstream pipelines. So I'd like to re-use ingestion_flow
as a starting point for two other flows, say, training_flow_1
and training_flow_2
.
The way I thought this could be accomplished was by simply running ingestion_flow
as a subflow at the beginning of both training_flow_{1,2}
. Obviously I don't want to ingest the same data twice. I created a caching key for client_ingestion_task
using the client
and date
parameters. However I quickly stumbled upon the fact that task caches only rely on local storage. I'm running these flows in kubernetes and I was hoping for flows to run concurrently ensuring that ingestion only happens once and the cached status gets re-used in the second flow to run.
The fact that task caches are restricted to local storage tells me this is not the way to implement what I wanted. Can someone point me in the right direction here? Happy to discuss more details if that's needed.Khuyen Tran
11/17/2022, 7:31 PMFernando Silveira
11/17/2022, 7:33 PM