https://prefect.io logo
d

Daniel Rothenberg

12/03/2020, 3:41 PM
Hi community - I have pair of tandem flows that are essentially a crude training/prediction workflow; one flow is intended to run once day, train an ML model, and archive its output to Google Cloud Storage, and the other flow is intended to run once an hour, grabbing the latest trained model it finds in GCS as well as the latest input data from the past hour, making a prediction, and shooting that off to a database. Prefect has worked amazing for this so far... it was super easy to integrate an agent into my team's k8s cluster, which has a lot of useful infrastructure including shared NFS mounts between pods and a dedicated
dask-gateway
to spin up distributed workflows. Major kudos to the team! I have a question though about scaling out this workflow. In my particular application, I basically need to apply this training/prediction workflow for O(10,000) different "assets". Put another way - my workflows each have a
Parameter
called "asset". I see two different ways of scaling my toy with one asset to the much larger set of a few thousand: 1. Use the
.map()
capabilities on my tasks and simply "map" a list of "assets" for the flow to run through; or 2. Convert my "asset"
Parameter
to be a list of all the assets, and do all the mapping inside each of my tasks in the workflow - by leaning on the fact that each task already has access to the dask cluster I'm running the workflow on I'm not sure which is the "recommended" approach here. The training/modeling flow generates, for each asset, artifacts on the order of 1MB or so that have to get passed around, but I'm worried about the performance of Prefect's scheduler if it needs to map to tasks over a list of several thousand items. On the other hand, things get... complicated (with my naive workflow for saving trained models and whatnot)... if I have to manually do all the book-keeping inside each task in the flow. Thoughts? Does anyone have experience scaling out ML training/prediction workflows on Prefect that can offer some insight?
d

Dylan

12/03/2020, 3:52 PM
Hi @Daniel Rothenberg! I’m glad you’re having a good experience so far! 😄 There are a few different approaches you might think about: 1. Using
map
to map over a list of assets within a single Flow run will work. As long as the k8s job has the resources available, the Flow Run can orchestrate thousands of mapped Task Runs. Unlike some other workflow tools, the centralized Prefect Scheduler (in your Prefect Server instance or Prefect Cloud) is only responsible for scheduling Flow Runs. The Flow Run itself is responsible for scheduling Task Runs, so you won’t overload a central scheduler 👍
upvote 2
2. If you want to keep your processing per-asset isolated, you could use an Orchestrator pattern to have a parent flow kick off a Flow Run per asset. This will create many k8s jobs in your cluster, but the assets will be processed in isolation and a failure to process one of them will not affect the others. Instead of scaling your resources for a single Flow Run, you’ll need your k8s cluster to handle many jobs, so it’s usually best to turn on auto-scaling
d

Daniel Rothenberg

12/03/2020, 3:55 PM
Thanks @Dylan! This is super helpful
d

Dylan

12/03/2020, 4:03 PM
Anytime!
p

Pedro Machado

12/04/2020, 12:50 AM
Hi Daniel. I'd be interested in learning what approach worked best for you.