https://prefect.io logo
Title
t

Thomas La Piana

02/10/2020, 9:56 AM
hi all, I'm looking in to adopting prefect for DS workflows instead of trying to shoehorn them into airflow. does anyone have some experience with this? Has anyone found it slowing down workflows as opposed to something like kubeflow? Does prefect core have a web ui or only supports the dask ui?
j

Joe Schmid

02/10/2020, 2:06 PM
Hi @Thomas La Piana, this is exactly our use case - data science workflows executed by Prefect on Dask. We use Prefect Cloud now, but originally we just used Prefect Core very successfully. (Core doesn't have a UI, but we kicked off Flow runs from Jupyter notebooks and, like you mentioned, used the Dask web dashboard to monitor them.) We are also long-time Airflow users (~4 years) and have found Prefect to be a much better fit for our DS pipelines. In particular, with Prefect & Dask we're able to run train/test experiments very quickly and with high parallelism. I wrote up a high-level blog post about our work: https://www.symphonyrm.com/galaxy-brain-data-science-workflows-with-prefect-dask/
Definitely let me know if I can answer more questions, either here or via DM.
t

Thomas La Piana

02/10/2020, 2:08 PM
thanks joe! i actually ran across your blog post when browsing around and i loved it, thanks a bunch for taking the time to write it. I'm particularly curious about your cloud vs. core experience. does core feel like the "first free hit" that then makes you want/need cloud for the real production stuff, or was everything running pretty flawlessly in core and you decided you'd appreciate the extra features? coming from airflow it feels like a weird split to have features like the UI locked away (i know a business is a business), i just want to make sure I don't feel like I'm not getting what I want out of prefect without paying for the cloud version
im also curious about what your DS workflow looks like locally and how smooth the transition is to running things in production. I'm trying to help our nascent DS program get off the ground (im a DE just helping out on the infra/engineering side) and want to hear what its like
thanks again for taking the time to answer my questions!
j

Joe Schmid

02/10/2020, 2:24 PM
@Thomas La Piana, no worries at all - happy to help. Here's my personal take on Core vs. Cloud: if your DS work is mainly focused on model development/R&D then Core will likely be sufficient. If your workflows need to go into production, run on a schedule, handle & monitor failures, etc. then you'll want to address many of the things that Cloud provides, i.e. scheduling, some way to monitor, etc. You can certainly do many of those things on your own without Cloud (e.g. use cron or some other scheduler to kick off Flow runs, your own logging, etc.). For us it came down to this: 1. The advantages of using Prefect for DS workflows were substantial (great scalability & fast execution on Dask, simple API, parameterized workflows that made it easy for data scientists, etc.) 2. The buy vs. build decision on Cloud was a no brainer -- we could have built scheduling, etc. but the cost for Cloud was very reasonable for the value. Having said all of that, I don't think you have to use Cloud from the outset. If I were in your shoes, I think I'd do a simple Proof-of-Concept with Core, get some experience with it, and then be able to make a more informed decision.
Also, I didn't answer this part of your quesiton:
"im also curious about what your DS workflow looks like locally and how smooth the transition is to running things in production."
Prefect is really flexible about how you run Flows, there are 3 modes that we run in: 1. Run the Flow directly in a notebook by passing the hostname:port of our Dask scheduler to Prefect's DaskExecutor (development) 2. Register the Flow with Cloud using S3 storage, passing the hostname:port of our Dask scheduler to Prefect's RemoteEnvironment then run Flows with Cloud (development) 3. Register the Flow with Cloud using Docker storage, passing the hostname:port of our Dask scheduler to Prefect's RemoteEnvironment then run Flows with Cloud (production) We set up our Dask cluster using Kubernetes (and the dask-kubernetes project) on AWS EKS. It also runs JupyterLab in the same k8s environment. That has worked very well for us. (e.g. we use cluster-autoscaler to scale nodes for Dask workers up and down dynamically.)
p

Preston Marshall

02/10/2020, 2:44 PM
Cool, so I’m guessing you’re using services to expose the cluster on k8s. I’m curious why y’all went with k8s over fargate?
j

Joe Schmid

02/10/2020, 2:49 PM
Yup, exactly. Great question on k8s vs. fargate. We use a shared network volume (AWS EFS) that gets mounted as a persistent volume on all pods -- Jupyter, Dask scheduler, & Dask workers. This serves two purposes: (1) our own internal Python code is available on all pods and (2) we store feature engineering results on this volume so that our pipelines very rarely have to re-run those steps, i.e. we can focus on train/test if features haven't changed. (You could also store them in S3 though.) Fargate doesn't support persistent volumes right now. If I was starting today though, I might start with Fargate and I know other Prefect users are using it (Fargate) successfully.
p

Preston Marshall

02/10/2020, 3:03 PM
makes sense, thanks
b

Braun Reyes

02/10/2020, 3:42 PM
EFS support for Fargate will hopefully be with us in the first half of this year 🤞 EFS for ECS on EC2 is in preview right now. Do you think you would want to run task on ECS on EC2...I feel like the Fargate Agent could be adapted to support EC2 and even AWS Batch eventually
⬆️ 1
We use the fargate agent and have not needed the persistent storage yet.
t

Thomas La Piana

02/10/2020, 3:52 PM
@Joe Schmid you mentioned wanting prefect cloud for things like scheduling, monitoring and handling failures, etc. this might sound crazy but, did you consider just running prefect with Airflow?
the DS team is also seriously looking at kubeflow, i have less experience with it but it honestly seems like way too much for our scale
i love prefect but if i have to pay for things that airflow already provides, and considering we already have a robust airflow deployment, it seems maybe the best option is to just combine the two?
j

Joe Schmid

02/10/2020, 4:11 PM
Running Prefect flows from Airflow to handle scheduling is not totally crazy, but I suspect you'd end up having to address some aspects, e.g. when you scale with Dask the logs for tasks will reside out on Dask workers -- you'll want some way to aggregate and store those logs in a way that make it easy to view what happened for a failed Flow or Task.
"if i have to pay for things that airflow already provides"
I totally get where you're coming from. While there's a lot of overlap on the surface, I think Airflow & Prefect are a bit apples & oranges. To be specific, we viewed Airflow as far less of a fit for data science where fast iterations during model development are critical, e.g. the Airflow scheduler typically has pretty high inter-task latency and doesn't easily support dynamic flow concepts like mapping. While there are things you can do to try to address that, being able to have very fast execution and very easy scalability (train/test in parallel) and then easily put the same Flow into production was a huge advantage. For us, I wanted to make sure we had a tool that did the most important job well (speed & scalability for data science) and then be able to address other aspects. (BTW, definitely not saying Prefect is for sure the right choice for your scenario, just trying to share our perspective and thoughts. It's interesting to talk to someone considering Prefect for data science since I think the majority of Prefect users are focused on DE/ETL. I actually think Prefect for DS is going to be huge and once people start to understand the advantages there will be a lot of adoption of it for DS.)
💯 2
b

Braun Reyes

02/10/2020, 4:39 PM
For me it is all about Cost of Ownership. I boot strapped an Airflow on Kubernetes setup and absolutely hated it! I was spending so much time making sure airflow was working and less time actually building on it. Some people live by it, which I think is great. For us, I found the investment in a user interface with robust API, metrics, logging, auth out of the box was well worth it.
👍 1
☝️ 1
once we get EFS or Lustre on Fargate and they implement the planned decrease in startup they talked about at re:invent..gonna be 🚀
🚀 1
t

Thomas La Piana

02/10/2020, 4:52 PM
interesting, @Joe Schmid i guess it seems logical to me to run this in Airflow actually after hearing this. Like you said, the Airflow latency is awful, but i'd imagine that when developing locally you would use prefect directly, and then when it was time to productionalize it would get run as an Airflow DAG where that stuff didn't matter as much (at least to us). we also use
dbt
for our model transformations, and it also has its own scheduler/DAGs, Airflow is just responsible for kicking it off. We use the AzureContainerInstances operator so i'm already used to having airflow jobs run remotely and pass the logs back up, i'd think this would be just as possible with prefect. just curious, what other DS workflow tools did you look at? Prefect came to mind because I evaluated it a year or so ago, but i don't have much experience with other DS workflow tools
j

Joe Schmid

02/10/2020, 4:58 PM
@Thomas La Piana we started looking at Prefect & Dask over a year ago so some of the more recent options (Metaflow, Kedro) weren't announced or were very early so I'm not the best to comment on those. My impression is that they tend to be tailored to a particular use case or infrastructure, e.g. Metaflow is focused on data science on AWS. Prefect just has really good generic building blocks that you can apply to any data-centric use case & infrastructure. (BTW, that makes sense on dbt. We just started adopting it for data standardization and really like it.)
t

Thomas La Piana

02/10/2020, 5:00 PM
in my mind prefect is like dbt for data scientists. my analysts use dbt locally to test and to run things, and then in prod airflow handles triggering it, retrying it, etc. i'm imagining a similar workflow for my data scientists but utilizing prefect instead
since everything is containerized anyway, i shouldn't have any sticky "it runs locally but not in prod" scenarios i'd think...
c

Chris White

02/11/2020, 4:24 PM
@Marvin archive “Prefect vs. Airflow”