Hi all, thanks for making this amazing project for...
# prefect-community
Hi all, thanks for making this amazing project for the world. A quick intro: I work at a computer vision company that works with Giga-pixel images and I am doing preliminary investigation for tools that can manage our machine learning workflows. Prefect looks very interesting. A question --- is there a web page where I can better understand the current level of adoption "in production systems" for Prefect? I read a couple blogposts from early users and (presumably) customers. That said, I wonder if there's one that'd give a more holistic, up-to-date view. Thank you.
Hi @dh! We have some materials in the works, but let me ask the team to see if we have any resources we can share more immediately! 🙂
Looks like we are very close to wrapping up our first case study! We'll reach out as soon as we can share it with you!
Thank you @Zachary Hughes. That would be lovely.
Hi @dh, we've been a long-time user of Prefect for machine learning pipelines and have been running them in production for 9 months. If I can answer any questions, feel free to post here or DM me any time.
Hi @Joe Schmid Thank you so much for being willing to share. For context, we have a hybrid infrastructure where training happens exclusively on-prem (Nvidia DGX-1 system) with datasets size 10-100 TeraBytes each and CI/CD + deployment lives on the AWS ecosystem (namely, aws EKS). With this said, may I ask a few questions. Any advice would be super useful. 1. is your entire ML pipeline living on Cloud ecosystem completely? If you have any portion on-prem, could you shed light on any challenges / takeaways you had integrating Prefect? 2. does your training pipeline use multi-node/host “and” multi-GPU training? If you’re using Dask executor for managing your cluster, could you share any takeaways/challenges you faced using Prefect+Dask?
please let me know if you or anyone else needs more context.
@dh for 1. we are primarily using Cloud (EKS also and some Fargate more recently). The only "on premise" we do is local development on data scientists and data engineers laptops. However, Prefect's pluggable executors and environments make that pretty easy, e.g. a Flow doesn't care if it's using DaskExecutor on a large Dask cluster in the Cloud or just running locally in-process.
@dh for 2. we definitely do a lot of multi-node Dask clusters in the Cloud. We'll run experiments on 100 node Dask clusters using EKS. We have done multi-GPU training and part of our pipeline at the moment needs Dask workers with high memory (128GB or 256GB) -- Prefect's ability to tag tasks with Dask resource tags works incredibly well for those, i.e. this task has to run on Dask worker with a GPU or a high memory worker, etc.
Main issues with getting that all working aren't related to Prefect, just the usual Cloud resource configuration stuff. For folks that haven't worked with k8s, EKS, etc. it can be a fairly high bar to learn all of that to get a distributed Dask cluster running in the Cloud, but it sounds like you're already quite familiar with those technologies so I suspect you'll have a fairy easy time.
We've found the Prefect team and community to be incredibly supportive and helpful, and the same for the Dask community. Prefect + Dask has been a huge win for us for machine learning on complex healthcare data sets. (Should be a case study coming out soon that @Zachary Hughes mentioned above.)
@Joe Schmid Thank you Joe for sharing fantastic advice. It’s relieving to hear there wasn’t a big hurdle you encountered while using Prefect in cloud / local environment. Our on-prem HPC resources are managed by SLURM which Dask seems to have an adapter for [1]. Looks like we will make some interesting use case for Prefect and Dask. I did not know about tagging and worker resources --- thanks for sharing. Regarding setting up the k8s stuff, I must say it has not been a smooth journey. :-) [1]: https://docs.dask.org/en/latest/setup/hpc.html
Thank you once again!
@dh Glad it was helpful -- feel free to ask more questions any time.
Regarding setting up the k8s stuff, I must say it has not been a smooth journey. 🙂
I'm right there with you. Now that we're up and running it's great, but it wasn't the easiest process getting to this point.
Our on-prem HPC resources are managed by SLURM which Dask seems to have an adapter for [1]
Yes, Dask has very good support for running on HPCs and my sense is that SLURM is one of the more popular resource managers. While I don't have direct experience of running Prefect & Dask on an HPC, I suspect you'll have a good experience, with good docs & a large community of Dask users on HPCs.
awesome. Definitely, once I start integrating prefect+dask for PoC on our system, I will share more!