Cole Murray

    Cole Murray

    4 months ago
    Hi All 👋, Relatively new to Prefect and looking into it for running my team's data operations. We have previous experience with Celery for running our distributed tasks, and are looking to upgrade to either Airflow or Prefect. With Airflow, we found several undesirable aspects & complexity which has caused us to investigate Prefect. Our workload: • 1000s of tasks • Lowest granularity is every 5 mins • Primarily python-based code • Homogenous dependencies, but may change in the future as we move to more ML based workflows. I have read through the prefect docs, and have some questions about how Prefect actually schedules tasks (in a non-prefect cloud context). From the docs, one criticism of Airflow is:
    • the centralized nature of the Airflow scheduler provides a single point of failure for the system
    In a typical master/worker deployment, you have a single-point-of-failure schedule (yes you can run it highly available with locking / other mechanisms) responsible for reading from a schedule DB, to invoke tasks into a queue to be processed by workers. Can someone clarify how Prefect server solves this issue? From the docs, it seems Prefect server is also a SPOF in this architecture. Based on code here: https://github.com/PrefectHQ/server/blob/master/src/prefect_server/services/towel/scheduler.py#L21, we would not be able to run several instances of the server simultaneously, as there is no locking taking place against the DB, and would cause double execution
    Anna Geller

    Anna Geller

    4 months ago
    Sounds like a great use case for Prefect 2.0! If you're unsure which version to choose, this page may help.
    • Homogenous dependencies, but may change in the future as we move to more ML based workflows.
    given this use case, you may use SubprocessFlowRunner with a virtual environment
    Can someone clarify how Prefect server solves this issue?
    Sure, we can. In the default Prefect Server configuration, the scheduler service can be seen as SPOF, but that's not true when you use Prefect Cloud or when you scale Server to be distributed - usually, when you need that level of reliability and scale, you could opt for Prefect Cloud. The same is valid for Prefect 2.0. Feel free to ask more questions, if my answer hasn't addressed your concerns about SPOF
    btw given you asked the same question both here and in #prefect-server, I'll remove the one asked in Server channel. For the future, it's enough to post only via a single channel, we'll get back to you when we are available
    Cole Murray

    Cole Murray

    4 months ago
    Thank you for the follow-up Anna. I've read through the Orion code-base (branch/orion) and have a few additional questions. Based on the Orion scheduler execution, it looks like it would run into issues if I were to run it in a multi-process / multi-server distributed environment. Specifically, with how scheduled flow runs are being executed: Ref here: https://github.com/PrefectHQ/prefect/blob/orion/src/prefect/orion/services/scheduler.py#L67 From what I've read in the code, the scheduler executes as follows: • While not all deployments are processed • Page through deployments in BATCH_SIZE • For deployments that should be scheduled, generate scheduled flow runs for each deployment • Insert the deployment runs Based on how the scheduler is pulling the deployments, here, if we had multiple servers/processes running this, we'd be duplicating work on the keys, as each instance would be doing table scans and paging by deploymentId. Do we have documentation on running Orion in a distributed environment? Based on the above, I think we would not want a multi-server/process setup, and to use something like Zookeeper's distributed lock to ensure only one is running at a time.
    Anna Geller

    Anna Geller

    4 months ago
    We currently have documentation showing how you can deploy Prefect 2.0 to a Kubernetes cluster. The command:
    prefect orion kubernetes-manifest
    will give you a Kubernetes Deployment manifest specifying everything you need to deploy Orion components to Kubernetes. You may then use Kubernetes-native features to ensure that all services run reliably in a fault-tolerant way. If you want to use Zookeeper, I'm afraid that at the moment, this is out of scope for Prefect 2.0, but you can give it a try and perhaps even contribute some recipe or blog post once you figure out a good setup? Fwiw, I can reassure you that scaling Orion is much easier than e.g. Prefect 1.0 because, under the hood, Orion is comprised of REST API services which are easier to scale and load balance than, e.g. GraphQL. Does this information help? Our focus is currently on building the most critical Prefect 2.0 features in order for 2.0 to come out of beta, but we may revisit distributed setting at a later time. If you are interested, I could open an issue so that you could keep track.
    Since you asked about the scheduling flow runs from deployments, we use idempotency keys to avoid duplicating work. We rely on idempotency keys to prevent duplicate inserts even with a single scheduler service. Again, just to reassure you: the scheduler is designed to scale horizontally, so it will be possible to deploy it in a way that there is no SPOF in your architecture.
    Michael Adkins

    Michael Adkins

    4 months ago
    I’m pretty sure we scale the scheduler in Prefect Server horizontally in Prefect Cloud without any issues, I can check if we have any custom code for that though. We also intend to scale the scheduler service horizontally in Cloud v2 so if there is a problem here it will be addressed.
    After looking a bit closer, we use idempotency keys to avoid duplicate runs.