https://prefect.io logo
Title
f

Feliks Krawczyk

08/27/2019, 5:26 AM
Hey there, was wondering if someone who knows about prefect can let me know if prefect would actually work in my particular use-case. Currently I’m using Airflow but I’m convinced it’s scheduler is broken in terms of what I want to do, and I don’t believe I’d ever have the time to try fix it. Context: - I have a service where I allow people to submit a bunch of parameters that I convert into Airflow dags. - Essentially every DAG is the same as it’s all templated, and they aren’t particular complicated. 95% of them are 3 operators: Start (send a message) -> Send a spark-submit job to EMR or Databricks -> Finish (send a Success / Fail message) ** In more complicated cases the I let people chain multiple send spark-submit jobs, but like I said its all templated. - I have ~2000 DAGs, and it’s likely to grow, my service is “self-service” so I let people create / delete / edit as much as they please. So I let DAGs be mutable. Airflow has done a pretty good job so far, but it’s creaking and adding hours of over head to scheduled dags. It’s got plenty of provisioned resources, but most of them sit idle instead of actually processing. For clarity: I have provisioned it to run up to 128 tasks at once over 8 workers (8 workers *16 parallel tasks). Yet it will only ever reach that peak when it’s got over 200 DAGs scheduled (think 00 UTC). However, when I have 70 DAGs scheduled to run, it’ll crawl through maybe 8-15 a time, instead of using all the resources, so my DAG queue is forever, instead of depleting. It makes backfills incredibly slow. For someone to backfill something over a years time at daily intervals will take over a week. What I would of expected is that Airflow will use all the resources at its disposal, but alas it does not and I’m frustrated with it. In terms of the architecture, I have 8 worker nodes and 1 node for the scheduler and webserver. I use celery and redis as my message queue and broker. Currently I use EFS to keep DAGs in sync across all workers. My question is.. is prefect ready for a use-case like mine? I need: - A scheduler / workers that will run at capacity that they are allotted to - Something that can handle 1000s of DAGs - Allow easily mutabilty of DAGs, i.e changing schedules / adding and deleting tasks I don’t need all of Airflow’s bells and whistles. My operators are currently extensions of the BashOperator / DatabricksOperator and some simple Python functions. Happy to answer questions. I am really keen to PoC (proof of concept) Prefect, but I want to know if its ready for me.
j

Jeremiah

08/27/2019, 1:22 PM
Hi @Feliks Krawczyk - thanks for writing! I’m sure Prefect can help you, as this is a common case of growing out of Airflow. We would slightly tailor your approach to take advantage of some of Prefect’s features, but nothing in this sounds daunting. We might suggest that instead of having thousands of near-identical flows, you have a single flow with a parameterized input (have a look at
prefect.Parameter
). This may keep management sane.
I do have a question about what you mean by “mutable” DAGs, however. If you mean uploading new flows, then Prefect Cloud has a streamlined versioning mechanism that simply promotes a new flow over an old version in a non-destructive manner. If however you mean actually modifying flows without redeploying them as new versions, Prefect doesn’t allow that.
Prefect + Dask will solve your resource issue (we see Dask clusters that handle 1000's of tasks simultaneously), but to be honest it sounds like you have much more of a orchestration challenge than a workflow code challenge. We would invite you to take a look at Prefect Cloud, which tackles this issue directly. What I mean by that is it doesn’t sound like you have issues actually running code; it sounds like you have issues coordinating thousands of runs and getting them to start!
f

Feliks Krawczyk

08/28/2019, 12:18 AM
Hey @Jeremiah thanks for responding. So by mutable dags I mean the following, I allow people to add / delete “steps” within DAGs. For example someone can do this:
Which is my default template. To then be able to do this:
As many times as they want. Effectively creating new versions of the DAGs, as I just overwrite the DAG files.
We might suggest that instead of having thousands of near-identical flows, you have a single flow with a parameterized input (have a look at prefect.Parameter). This may keep management sane.
I’m not quite sure how this would work exactly? Each DAG I create has its own schedule, and we heavily utilise Airflow parameters. Although the DAGs themselves are “almost” identical in flow. The metadata within them is completely different (i.e schedules / number of steps etc). We also heavily utilise the “clear” functionality in airflow to re-run days which fail due to upstream issues. For more context what my actual service does is: It Materialises peoples SQL into tables within our Datalake. So instead of people querying massive raw tables for their reports (which isn’t scalable) we ask them to submit SQL that extracts a delta (usually daily) and append to their own tables that only contain the subset of data that they actually want. They then query these smaller tables
c

Chris White

08/28/2019, 1:12 AM
Hi @Feliks Krawczyk - just wanted to chime in here with a few observations: - it sounds fundamentally like you have a problem of scale, which is certainly a common one with Airflow. Prefect Cloud is our workflow orchestration platform and could easily handle the scale you describe, with essentially no latency between individual task runs within a Flow. - it also sounds like you’ve exposed a sort of DSL / “recipe” cookbook to your users for creating Airflow DAGs; it would be incredibly simple to convert such a pattern to creating Prefect Flows, so there’s no concern there either —> Prefect is capable of expressing much more complicated and dynamic task relationships than Airflow - independently of Cloud, from your description it does sound like you could benefit from Prefect Parameters (a parameter for the SQL query + some table name / connection parameters?), but this is more of a convenience for you than any sort of requirement And just in case you haven’t seen it, we have a fairly extensive (although not exhaustive) writeup on some major differences between Prefect and Airflow that you might enjoy reading: https://medium.com/the-prefect-blog/why-not-airflow-4cfa423299c4 One topic this post does not address is that of airflow workers, but Prefect does not require the deployment of always-on workers (they can be dynamically spawned for each workflow run, for example)
f

Feliks Krawczyk

08/28/2019, 1:20 AM
Interesting.. thanks! Yep I have read that blog which is why I’ve heard about Prefect. But yeah, I want to essentially create the Prefect version of some “simple” DAGs (changing whatever I need to make it work) and just plug my service into that… and it should all just work But I was wanting to also know if Prefect can also handle the things that Airflow does well: - Macros parameter injection. I.e
select * from data where day = {%Y-%m-%d}
- Ease of re-running failures (clearing tasks and things kick off again) I think you’ve given me enough to at least try a Proof of Concept.
c

Chris White

08/28/2019, 1:25 AM
Awesome! Just to help you get started: - Macros: we definitely support templating (for example, check out this rewrite of the Airflow tutorial DAG which uses templating: https://docs.prefect.io/guide/examples/airflow_tutorial_dag.html). Some variables that are always present in context for templating can be found here: https://docs.prefect.io/api/unreleased/utilities/context.html and note that because Prefect tasks can exchange data, there’s essentially no limit to what you can template - ease of re-running failures: while this is certainly possible with our open source library, you might find it a little finnicky. This is a feature that is much easier to handle with a full orchestration / persistence layer (e.g., Prefect Cloud) but you can cross that bridge when you experience it Let us know if you have any other questions, always happy to help!