The concept of Agent and Work Queues is really use...
# prefect-community
The concept of Agent and Work Queues is really useful. But as far as I understand, Work Queue's hold only flows, not tasks. What queue do tasks go into? Is there a way for a user to tap into that queue?
In my company we have a home brewed workflow orchestrator, which we'd like to ditch in favor of Prefect. But there's a but. Our orchestrator pushes tasks into SQS/RabbitMQ/etc and relies on Task Agents to pull tasks from there. This gives us fine control over task execution and resource utilization, which is important because Tasks, not flows, are the computationally expensive part. For example, we often execute Task Agents on Kubernetes pods and SLURM workers at the same time. As another example, we can make sure that GPU-enabled Task Agents pull GPU tasks, but if there's no GPU tasks available they will pull CPU tasks not to stay idle. Is there a way for us to achieve such fine-grained control with Prefect?
Hey @Sergiy Popovych Another user asked a similar question recently here,, with some links to different articles I think would be relevant to your use case
Thanks Mason! I've seen these resources, and unless I'm mistaken they don't address our use case. The discourse thread suggests starting an Agent on a CPU cluster and an Agent on a GPU cluster, I get that. But each agent will be responsible for the whole flow, right? If that's the case, this wouldn't let me split tasks from a single flow across a SLURM and Kubernetes cluster, or two different Kubernetes clusters, etc.
Hi Sergiy! Would subflows possibly fit your use case? You can call multiple subflows from inside a parent flow, with each subflow having its own task runner. Then, within each subflow, you could break things down into smaller steps using tasks. The docs explain it in a bit more depth:
🙌 1
👍 1
Are you suggesting using subflows to statically split my tasks into N groups, one for each cluster? It would be hard to keep good cluster utilization with such approach. I would have to split my tasks in exactly the right proportion. For example, we have some on premise GPU's and some cloud budget, and it's important to efficiently utilize both. We deal with computationally heavy jobs and can't afford to underutilize resources. Additional complication is that spot GPU availability varies, so it's impossible to split tasks in correct proportion a priori. The problem is that each subflow will be bound to a single TaskRunner, I don't think the currently available TaskRunners will let me utilize my clusters in the way that I want. What would work for me is something like
, which will just push task specs into a queue and wait for status reports on that/another queue. That's why I'm asking about how the individual tasks are queued and how hard it is to tap into that.
Let me take another shot at describing the situation. There are three clusters: a. On-premise GPUs managed by SLURM b. GCP Auto Scaling Group with GPU nodes c. GCP Auto Scaling Group with CPU nodes The job has a high number of both CPU and GPU tasks with complex dependencies. We need to achieve the following two constraints: 1. If there are GPU tasks available from any flows, we need both cluster (a) and cluster (b) to consume them. 2. If and only if there are no GPU tasks available, we need both cluster (a) and cluster (b) to consume any available CPU tasks. Statically splitting my tasks into subflows for cluster (a) and subflows for cluster (b) does not satisfy constraint (1), as it allows situations where cluster (a) has no tasks, and the whole system is waiting for cluster (b) to finish (assuming each subflow has more than 1 task). I also don't see a way to satisfy constraint (2). Or am I misunderstanding something, and there is a way to satisfy both constraints?
I think I could satisfy constraints by having a dedicated subflow for each individual task, but I'm not sure if that will be efficient. Overall, our jobs easily amass hundreds of thousands of tasks. But if that will work efficiently, I think I'll get what I need?
Thanks for the added explanation! I'm going to reach out to some of my colleagues to get you the best answer.
🙏 1
What queue do tasks go into? Is there a way for a user to tap into that queue?
@Sergiy Popovych You should treat tasks more as observability concepts and a way to execute work in various ways using task runners (Dask, Ray, concurrent execution with async) rather than a way to distribute work as distributed message queues do Prefect creates flow runs from deployments, each deployment has a work queue name associated with it - this is how Prefect knows which work queue should pick up runs from that deployment I highly recommend checking this Discourse topic that will clarify confusion If you need to spread the load of a large number of tasks for scale, I recommend trying out Dask and Ray task runners allowing you to leverage distributed compute clusters Similarly, you can use Kubernetes and various cloud vendor services for better resource allocation Generally speaking, in the community, we can point you to the right resources but giving personalized infrastructure would be hard without analyzing your infrastructure requirements in more detail. For such support, you can reach out to to get in touch with infrastructure experts; they can analyze what solution would work best for your use case.
👍 1
🔥 1