(selecting prefect) Hiya, I have a general questio...
# prefect-community
a
(selecting prefect) Hiya, I have a general question - thanks for considering it. I'm trying to explain the case that the orchestration capability for our very large (500 source system) data lake shouldn't just rush to bake in airflow. I think we should have an event bus, and hang airflow off the back of the bus so that we can evolve and swap/add tools (to prefect or others) over the life of the platform. Are you aware of limitations in airflow or strengths in prefect that I could use as examples to warn my team as we try many invocations/second of a smallish number of dynamic dags? Thanks!
n
Hi @Alister Lee and welcome! @Chris White wrote a terrific blog post called Why Not Airflow that outlines many of the basic comparisons; I'd recommend to start there! If you have any other questions we're happy to answer them 😄
a
I have the why not airflow article, but I'm hoping for specifics directly relevant to us. Unreasonable I know.
Hoping someone is running massive datalakes on prefect!
n
Ah ok! Hopefully someone else can chime in if they have a similar setup re: datalakes on Prefect 😄
a
Thanks for looking!
j
Hi @Alister Lee, I don't have a good sense for the specifics of your use case, but as long-time Airflow users (4 years) and long-ish time Prefect users (1 year) I can give some perspective on our use. We initially used Prefect for data science where our requirements were roughly: • High parallelism & easy scaling (run lots of machine learning experiments quickly) • Very fast execution (inter-task latency of milliseconds) • Support out-of-core data sizes (> RAM on 1 machine) • Workflow semantics (multi-step pipeline for ML) • Easy ML for data scientists familiar with Python Some of these (e.g. out of core data sizes, easy ML in Python) map more to Dask than Prefect per se, but since Prefect uses Dask for execution we viewed the combination as a big win. Using Prefect (& Dask) for data science has been very successful for us. Having gotten used to Prefect, I'd find it hard to go back to Airflow, even for data engineering. Specifically, I've gotten spoiled by the fast execution, easy scaling on Dask, simple but powerful API, very CICD-friendly deployment, etc. If I can answer more questions or if you want to go into more detail on your use cases and requirements, feel free to post here or DM me.
j
Hi @Alister Lee, one thing you may want to explore is the pattern for actually kicking off executions. For example, as soon as you say “invocations per second”, Airflow may already not be the right tool. When I was building Airflow, I considered a DAG “fast” if it ran once per hour; every now and then I saw production DAGs that ran every 15 minutes. Maybe things have improved but IIRC it takes the scheduler 10 seconds to queue each task, so I don’t think multiple executions per second is feasible. Furthermore, executing an Airflow run off-schedule is hairy; these runs have a special status (I believed they’re called “externally triggered”) and are slightly less-than-first-class in terms of how you can work with them. Airflow was really built for (relatively) slow batch processes on a fixed schedule. Prefect, in contrast, was designed to allow flow execution whenever you want, as often as you want, and for any reason. The system is primarily limited by the resources you make available to it and its own overhead (in terms of data serialization and network). To be clear, Prefect’s primary goal is not sheer speed (though we’ve got to be one of the faster workflow systems around), but we do love when things go quickly 🙂
j
Also, from your mention of
"many invocations/second of a smallish number of dynamic dags"
that sounds like maybe event/stream processing. There are a number of Prefect users interested in this topic. (See this proposal: https://docs.prefect.io/core/PINs/PIN-14-Listener-Flows-2.html) Follow-up questions: is it important that each event get processed individually or could you do them in batches? Are there timing requirements, e.g. does an event have to be processed within X amount of time? While Prefect is not currently focused specifically on event/stream processing, I think you're far more likely to be successful processing small batches of events with Prefect than you would be with Airflow, given its execution speed. (Airflow DAGs tend to have fewer large, longer-running tasks while Prefect & Dask do very well with lots of small, fast tasks.)
a
Thats helpful thanks. I think the space I'm in is event-driven: "when this happens, then do this, this and this." This describes it: https://dzone.com/articles/event-driven-orchestration-an-effective-microservi
The attraction of Prefect is the help that the UI offers Support.
s
I'm on a project that is a tale of two data lakes, the first "starter" lake was build using Airflow only, and the second one is a dog's breakfast, um competition of ideas, of everything: Jenkins, Airflow, Prefect, All-the-AWS-stuff, and recently someone brought up the Google and Azure and those tools. The team-within-a-team that I'm on is the "Prefect guys", so I have some experience of defending its use in the projects. Our experience with Prefect started as local executors in notebooks, those running locally as ETL flows. These turned into papermill jobs and distributed individually to a pile of other machines. I continue to be impressed that Prefect continues to be useful even in situations of infrastructure duress: minimal clustering, no Dask, no k8. It is like running a big engine with only a couple of cylinders firing. Our groups are pathologically opposed to Kubernetes, so…  I do what I can with what I have. Full disclosure, I'm a BPMN guy through and through, so your article reference tickled me because I think that's the way forward for the orchestration. That said, I think I’ve convinced the right people to bring in NiFi so we have a path forward. Camunda has a product called Zeebe that's supposed to be for microservices, but they've started confusing their offerings and both of their products suffer because of it. I see some really interesting promise in the Prefect flow idea with BPMN, but your observation that something needs to orchestrate the Prefect flows is exactly where we are right now. We've created several meta-Prefect flows -- putting the T in ETL -- using Prefect itself. However, without Kafka, JMS, AMQP, SQL, or some kind of messaging bus we've struggled to keep it decoupled. Because Prefect can run jobs quickly we’ve created polling loops to check for messages but it’s a weak substitute for a proper event-based polling system. Aside, pretty much anything I write, likely the Prefect folks have already thought of a while ago, e.g., https://docs.prefect.io/core/PINs/PIN-14-Listener-Flows-2.html You can get a long, long way into Prefect without leaning fully into Dask. And that's biting us as the number of our flows has increased greatly. And Dask can be run with Docker or Podman, but TLS, scaling, and authz are all greatly improved by using k8s. I think about it sometimes as looking up I need to see something like BPMN or data-flow, looking down I should see an ocean of Dask. But I don’t have either so the world feels pretty confined sometimes. Especially as we have a couple of hundred flows – not 500 yet! I mentioned Kubernetes up top and today (weakly held strong opinion!) my gut tells me a stack-of-success for most of our pipelining looks like Prefect-Dask-K8S and NiFi. And I'm no genius here, the Prefect folks would have told you that: hence they have a cloud offering where they take care of all that. We were very close to getting on that but the current economic conditions have forced us to tighten up.
a
Wow thanks very much for the detail. I fear we are going to get the competition of ideas, but I fear more that we will have to restart the whole platform like your first because the coupled web of airflow dags will become unmanageable with no boundaries.. I'm heartened by your comment about needing to orchestrating the prefect flows - I think that means kicking them off in the presence of combinations of events. @Steve Taylor Are you in Australia perchance?
s
Glad I could help. I'm in the US, Colorado specifically, but also Canadian, so I go back and forth north-south more than east-west. 😄
a
Are you at a bank? Are you interested in talking to some arch's and designers about your experience? I'll DM you if you have appetite. Thanks again.