I'm on a project that is a tale of two data lakes, the first "starter" lake was build using Airflow only, and the second one is a dog's breakfast, um competition of ideas, of everything: Jenkins, Airflow, Prefect, All-the-AWS-stuff, and recently someone brought up the Google and Azure and those tools. The team-within-a-team that I'm on is the "Prefect guys", so I have some experience of defending its use in the projects.
Our experience with Prefect started as local executors in notebooks, those running locally as ETL flows. These turned into papermill jobs and distributed individually to a pile of other machines. I continue to be impressed that Prefect continues to be useful even in situations of infrastructure duress: minimal clustering, no Dask, no k8. It is like running a big engine with only a couple of cylinders firing. Our groups are pathologically opposed to Kubernetes, so⌠ I do what I can with what I have.
Full disclosure, I'm a BPMN guy through and through, so your article reference tickled me because I think that's the way forward for the orchestration. That said, I think Iâve convinced the right people to bring in NiFi so we have a path forward. Camunda has a product called Zeebe that's supposed to be for microservices, but they've started confusing their offerings and both of their products suffer because of it. I see some really interesting promise in the Prefect flow idea with BPMN, but your observation that something needs to orchestrate the Prefect flows is exactly where we are right now. We've created several meta-Prefect flows -- putting the T in ETL -- using Prefect itself. However, without Kafka, JMS, AMQP, SQL, or some kind of messaging bus we've struggled to keep it decoupled. Because Prefect can run jobs quickly weâve created polling loops to check for messages but itâs a weak substitute for a proper event-based polling system. Aside, pretty much anything I write, likely the Prefect folks have already thought of a while ago, e.g.,
https://docs.prefect.io/core/PINs/PIN-14-Listener-Flows-2.html
You can get a long, long way into Prefect without leaning fully into Dask. And that's biting us as the number of our flows has increased greatly. And Dask can be run with Docker or Podman, but TLS, scaling, and authz are all greatly improved by using k8s. I think about it sometimes as looking up I need to see something like BPMN or data-flow, looking down I should see an ocean of Dask. But I donât have either so the world feels pretty confined sometimes. Especially as we have a couple of hundred flows â not 500 yet!
I mentioned Kubernetes up top and today (weakly held strong opinion!) my gut tells me a stack-of-success for most of our pipelining looks like Prefect-Dask-K8S and NiFi. And I'm no genius here, the Prefect folks would have told you that: hence they have a cloud offering where they take care of all that. We were very close to getting on that but the current economic conditions have forced us to tighten up.