Is there a way to set the max_workers and max_depl...
# ask-community
j
Is there a way to set the max_workers and max_deployed_flows for a single Agent? I can see the max_workers is hardcoded here. Is it possible to add a max_workers argument to the base Agent class? I believe we would only need to change this one line. In a number of cases, I would like to limit the number of threads (workers) the Agent is using. And for max_deployed_flows to an Agent, it looks like Agents grab all flows that are ready by default, shown here. It would be useful to add a graphql "limit" tag to this mutation to ensure one Agent isn't "hogging" flow runs when another (with equivalent Labels) is open and ready for them. I'm not sure if this is possible in a mutation though and may be trickier to change.
j
Can you expand on why you want to limit the number of threads on the agent? This is only used for starting flow runs/network requests, it won't limit the number of active flow runs started by a single agent.
For the 2nd issue, the limiting is currently handled server side (I believe the limit is 25 currently). Generally I recommend users not to rely on prefect to distribute work evenly among agents (we're not a resource manager), and instead deploy on a backend that handles that for you.
j
My agent architecture is pretty atypical because I have agents spread-out over numerous desktops as well as supercomputer nodes (so submitting Agents via Slurm). In the Slurm cases, I'd like to have many agents (up to 40 at the moment) that each only run one flow at a time. Ideally, I have Prefect running on one thread and then all other threads of the slurm job are used by tasks (I submit shell commands that use mpirun).
From what you're saying, I should write a custom executor rather than have these special Agents. Is that right?
j
I wouldn't run a prefect agent in a slurm job. Rather, I'd have the prefect agent deploy flow runs as slurm jobs. So you have a single agent running on an edge node in your cluster that kicks off slurm jobs for each flow run. Rely on slurm to handle the job queuing and resource management.
Or batch your tasks into larger flows and run using a local agent (on an edge node) and a
DaskExecutor
using
dask-jobqueue
to distribute the tasks throughut the cluster.
We currently don't have a HPC jobqueue agent, but that's not out of scope.
j
That setup is actually what I'm trying to avoid haha. If I have two HPC clusters and a bunch of desktops running flows, one HPC cluster may be backed-up and the slurm jobs sit, while the other HPC cluster is open and getting through jobs quickly --- so my slurm jobs should be flow-run agnostic. These instead should grab the next flow-run available as soon as the slurm job starts.
j
Hmmm, I don't really have a good response for that right now. Prefect currently isn't designed to be a resource manager, so dispersal of flows across equivalent agents isn't guaranteed to be fair.
j
I think you're spot on with using a Dask cluster though -- the issue I have with dask-jobqueue is firewalls. On some university clusters, it's a hassle to get permission to open some of these up. If only their dask-workers followed Prefect's hybrid approach 😂
j
If you're running the client on an edge node, you shouldn't have a firewall issue in my experience. Could always go the ssh tunnel route, but admins don't always like that either :/
If you're trying to distribute jobs across a large set of varied machiens, you might rely on dask-ssh and dask to do the resource management 🙂 https://docs.dask.org/en/latest/setup/ssh.html
j
Yeah, admins for each cluster follow different rules. Prefect's approach just bypasses the issue by having Agent signals be one-directional.
Thanks though! I've been trying to avoid Dask clusters but I'll take another stab at it.