https://prefect.io logo
Title
f

Fernando Silveira

11/17/2022, 6:21 PM
šŸ‘‹ Hi all I'm evaluating Prefect 2 for a few ML data pipelines in my team and I'm a little confused about what the right way to achieve what I need. This feels like a really basic use-case so I'm sure I'm just missing something obvious. To make things simple, I have an initial data ingestion flow that brings data into S3 (let's call it
ingestion_flow
). A couple of notes about this: ā€¢
ingestion_flow
runs every day and to ingests a data partition for that date. The flow is parameterized with the
date
to ingest. ā€¢ the flow is organized into a few tasks to ingest data from different clients. So the flow just runs a for loop running
client_ingestion_task(client, date)
for each
client
Once ingested this data is useful for a bunch of downstream pipelines. So I'd like to re-use
ingestion_flow
as a starting point for two other flows, say,
training_flow_1
and
training_flow_2
. The way I thought this could be accomplished was by simply running
ingestion_flow
as a subflow at the beginning of both
training_flow_{1,2}
. Obviously I don't want to ingest the same data twice. I created a caching key for
client_ingestion_task
using the
client
and
date
parameters. However I quickly stumbled upon the fact that task caches only rely on local storage. I'm running these flows in kubernetes and I was hoping for flows to run concurrently ensuring that ingestion only happens once and the cached status gets re-used in the second flow to run. The fact that task caches are restricted to local storage tells me this is not the way to implement what I wanted. Can someone point me in the right direction here? Happy to discuss more details if that's needed.
āœ… 1
k

Khuyen Tran

11/17/2022, 7:31 PM
It seems like your question is related to how to store your results in a remote storage. Here is how you can do that.
f

Fernando Silveira

11/17/2022, 7:33 PM
Thanks a lot, @Khuyen Tran. That looks exactly like what I needed to know.
šŸŽ‰ 1