Thomas La Piana
02/10/2020, 9:56 AMJoe Schmid
02/10/2020, 2:06 PMThomas La Piana
02/10/2020, 2:08 PMJoe Schmid
02/10/2020, 2:24 PM"im also curious about what your DS workflow looks like locally and how smooth the transition is to running things in production."
Prefect is really flexible about how you run Flows, there are 3 modes that we run in:
1. Run the Flow directly in a notebook by passing the hostname:port of our Dask scheduler to Prefect's DaskExecutor (development)
2. Register the Flow with Cloud using S3 storage, passing the hostname:port of our Dask scheduler to Prefect's RemoteEnvironment then run Flows with Cloud (development)
3. Register the Flow with Cloud using Docker storage, passing the hostname:port of our Dask scheduler to Prefect's RemoteEnvironment then run Flows with Cloud (production)
We set up our Dask cluster using Kubernetes (and the dask-kubernetes project) on AWS EKS. It also runs JupyterLab in the same k8s environment. That has worked very well for us. (e.g. we use cluster-autoscaler to scale nodes for Dask workers up and down dynamically.)Preston Marshall
02/10/2020, 2:44 PMJoe Schmid
02/10/2020, 2:49 PMPreston Marshall
02/10/2020, 3:03 PMBraun Reyes
02/10/2020, 3:42 PMThomas La Piana
02/10/2020, 3:52 PMJoe Schmid
02/10/2020, 4:11 PM"if i have to pay for things that airflow already provides"
I totally get where you're coming from. While there's a lot of overlap on the surface, I think Airflow & Prefect are a bit apples & oranges. To be specific, we viewed Airflow as far less of a fit for data science where fast iterations during model development are critical, e.g. the Airflow scheduler typically has pretty high inter-task latency and doesn't easily support dynamic flow concepts like mapping. While there are things you can do to try to address that, being able to have very fast execution and very easy scalability (train/test in parallel) and then easily put the same Flow into production was a huge advantage. For us, I wanted to make sure we had a tool that did the most important job well (speed & scalability for data science) and then be able to address other aspects. (BTW, definitely not saying Prefect is for sure the right choice for your scenario, just trying to share our perspective and thoughts. It's interesting to talk to someone considering Prefect for data science since I think the majority of Prefect users are focused on DE/ETL. I actually think Prefect for DS is going to be huge and once people start to understand the advantages there will be a lot of adoption of it for DS.)Braun Reyes
02/10/2020, 4:39 PMThomas La Piana
02/10/2020, 4:52 PMdbt
for our model transformations, and it also has its own scheduler/DAGs, Airflow is just responsible for kicking it off.
We use the AzureContainerInstances operator so i'm already used to having airflow jobs run remotely and pass the logs back up, i'd think this would be just as possible with prefect.
just curious, what other DS workflow tools did you look at? Prefect came to mind because I evaluated it a year or so ago, but i don't have much experience with other DS workflow toolsJoe Schmid
02/10/2020, 4:58 PMThomas La Piana
02/10/2020, 5:00 PMChris White
02/11/2020, 4:24 PMMarvin
02/11/2020, 4:25 PM