:ear: Prefect VS Databricks, opinions? :ear: On o...
# ask-community
p
👂 Prefect VS Databricks, opinions? 👂 On one of my client’s projects I implemented all ETL pipelines using Prefect. I chose that technology because: • there are a lot of external dependencies (mostly self-developed API libs) • and because most of the business logic is so complex / needs to be abstracted, which would be hard to achieve in notebook/cloud environment My client now wants me to migrate everything to Databricks. I am trying to explain that they do not serve the exact same purpose and that in fact, both Airflow and Prefect have integrations with Databricks, but they don’t really get the point. They don’t have a tech background and their main argument is “harmonizing” the stack because they have 2-3 notebooks running in Databricks for a few months now (of very poor code quality btw). I am trying to justify my stack choice but it is difficult to explain in layman’s terms - what is your take on this? PS: also happy to hear any arguments that would go in favour of choosing Databricks over Prefect.
t
IMO Databricks is first and foremost a Spark platform and if it is to be used as all, then it should be used solely for any Spark workflows you have. It is super expensive and if you are not using Spark, then why pay 3 times the price for an ETL that doesn't use Spark?
s
Databricks has been developing its orchestration tool for a while now and released the first version to the public a few months ago. From what I could see, it is still very simplistic and very far from what Prefect offers. With that said, I can understand the appeal of having only one tool for all your ML and DE needs.
t
As an aside, as a platform it comes with lots of issues. It is buggy, the support is poor and it is very opinionated about how it expects people to use the platform.
upvote 1
There are also other ways to run notebooks. I thought prefect actually had that as an integration, but I have not investigated it. (Personally I would not run production code in a notebook anyway.)
Moaning about databricks aside, it should usually be considered as part of your platform and not the platform in itself. Speaking with Databricks solution architects themselves, they also agree with this point. This is why it is offered as part of a wider platform, such as Azure or AWS.
p
@Sylvain Hazard yeah I saw that, thanks for the insights. I was wondering if the orchestration tool holds up against Prefect (one of its features I like the most).
@Thomas Furmston thanks for the insights. • Not a Spark expert, but do you have to explicitly “declare” in your notebooks that you want to run things in Spark clusters? I thought that would happen automagically • Can you expand on your last point (platform)? I think I get your point but would be interesting to know what you mean exactly (using other services on the cloud provider’s infrastructure I assume)
t
All Databricks clusters are configured to run spark. So when you define a cluster you need to set the number of workers, auto-scaling etc, which is configuration for your spark cluster. You can now define a "single node cluster" for non-spark jobs, but the VMs are still super expensive.
with regard to the last point, in Azure, for example, you would have databricks as part of your whole Azure infrastructure. For example, you could run non-spark jobs on k8s and then spark ones on databricks
a
@Pierre Monico Both tools solve very different problems. Databricks is primarily a tool to run data transformations in Spark, even though as Sylvain mentioned, they try to expand their range by providing data lakehouse and some drag-and-drop workflow orchestration. You could compare Databricks more to Dask or even DBT, rather than Prefect or Airflow. Prefect is not an ETL tool, but it can govern your ETL, provide visibility into the states of individual tasks and make sure it runs reliably, whether running on schedule, locally or triggered ad-hoc via UI or API. The best explanation I would have to persuade your client is: once you have your Databricks notebooks, ask you client to answer those questions: • How do you run those jobs on schedule? You would inherently add another tool to the architecture just to schedule those jobs, even if it’s just CRON. • How do you know those Databricks jobs run successfully? Do you manually track this? • How do you set up notifications on success or failure? • How do you manage data dependencies e.g. when those Databricks jobs can only run once some data actually arrived in your data lake or data warehouse? Do you run your extract-load jobs at 1 AM, and Databricks transformations that rely on that data at 2 AM? I think I don’t need to explain how unreliable this would be. All of those examples above is negative engineering that Prefect solves, but Databricks (afaik) doesn’t.
🙌 1
s
To be fair, the Databricks platform (at least on Azure which is what I used to work on) answers the first three points you bring up. There is a Job tool made for scheduling and tracking jobs execution and you can set those up to send and email on succes/failure/skip. It is mainly made to work with DB notebooks and is somehow cumbersome to have work with say Python scripts though. It also does not have as much integration with other tools as Prefect does (e.g. Slack messages, emailing specific data, etc.) The fourth point is actually the most important imo. DB, through Spark, allows you to integrate streaming into your workflows, which is a really interesting way to orchestrate tasks but comes with a lot of added costs and complexity.
🙌 1
a
Thanks @Sylvain Hazard. I really wanted to just highlight the point that both of those tools solve different problems and those bullet points were just examples. I could go on listing things that Prefect can do but Databricks cannot 🙂 but the point is that using Databricks leaves plenty of negative engineering issues unanswered. You can then either waste your engineering time building workarounds or you can solve those problems using the right tool for the job (i.e. Prefect).
s
I definitely agree with your points. Having worked with both, Prefect is by far easier to work with and smoother to integrate with other tools which Databricks does not do nor aims to do.
👍 2
p
Thanks @Thomas Furmston
👍 1
t
np
p
@Anna Geller thanks for the hints! • You can set schedules in Databricks, no need for CRON according to their docs • Isn’t Databricks job result UI kind of the same as Prefect’s? • Email alerts seem to be available in Databricks (and an API, according to their docs) • Doc says jobs with multiple tasks, but I think this is not really DRY I don’t know Databricks that well, but just want to make sure all my arguments are actually valid 🙂 Some more arguments that go against Databricks for me is also the difficulties handling complex (software) dependencies, the lack of good versioning, and handling staging/prod environements.
s
Your last point is very true. In my previous job, we used multiple Databricks workspaces to separate staging/prod workflows and data. Building the tooling around it was very complex and not without errors.
a
@Pierre Monico I was really trying to avoid a detailed feature comparison 🙂 and those were only examples. I’m sure Databricks can do a lot, but to persuade your client I’m not sure that purely comparing features is the most helpful strategy. I would focus more on the problem that each tool solves and the developer experience. For example if you have 20 tasks that each depend on each other in some way (some need to run sequentially one after the other, and some in parallel), and you have pipelines that need various (possibly conflicting) package dependencies to interact with 3rd party systems: • How do you define dependencies between your tasks - do you want to click through a drag-and-drop tool to manually set this up 20 times? What if you have 100 tasks? In Prefect, you just define it in Python. • How do you identify what was the issue when your job failed? In Prefect, your tasks can be very small and you have visibility into each of them. Databricks encourages larger tasks, and it’s more difficult to identify the root cause of your problem this way. In Prefect, you can get notified about the exact task that failed and see it clearly marked red in the UI. This is why it’s most helpful when things go wrong because you have that visibility. • What if you want to define more complex dependencies, e.g. this task can only run if specific condition is met, or should be skipped otherwise. Or all downstream tasks should fail if any of the upstream task failed. Such complex dependencies are best defined as code, you can’t do that in a drag-and-drop tool or YAML. • What if your pipeline A and pipeline B need different (conflicting) Python dependencies? I have no idea how this is done in Databricks. In Prefect, again, you can have it all defined programmatically. With Docker storage, Prefect can even build a Docker image for you and push it to a registry of your choice. I would really focus more on the problem, the user experience and the audience this tool is for, rather than purely comparing features. Engineers I worked with usually hate drag-and-drop and they want reproducible workflow as code, and infrastructure as code. It depends on your team’s preference.
p
Thank you so much @Anna Geller, really helpful! And thanks to the others as well.
👍 1