Hello, I'm starting to test the Kubernetes Agent o...
# prefect-server
m
Hello, I'm starting to test the Kubernetes Agent on Prefect 1.x....our requirement is to run up to 1000 concurrent jobs. With the standard executor I'm seeing a memory requirement of about 200MB per job, so we'd struggle to reach 1000 concurrent jobs on our moderately sized K8s cluster. Would the Static Dask Cluster approach be a better option? Are there any trade-offs in terms of reliability compared to using Kubernetes jobs - our workload is expected to be a mixture of short running (a minute) and long running (a few minutes to an hour) jobs - which might influence our approach towards reliability.
k
I’ve been thinking a lot about this personally. Using something like mapping to create flow runs, each with its own pod puts a high burst on Kubernetes. It’s very common to see some pods unable to spin up because they aren’t able to get resources. If they have the same Python requirements, you can just use Dask and mapping to submit tasks to the cluster. The Dask engine itself will have retries for your work. You can have multiple jobs on one worker. You don’t need to waste a lot of time on initialization. Dask can then handle the auto-scaling on it’s side and has fault tolerance mechanisms already to retry work. Kubernetes will just label the pod as failed, and then Prefect would have to re-submit that for you using the Lazarus proces. I would suggest just using Dask to orchestrate the jobs
m
Thanks Kevin, we'll give that a try
Hi @Kevin Kho, have had some delays but now progressing on this again. Local testing was very encouraging, on our K8s cluster we are still running with the KubernetesAgent, but I'd like to switch that to the LocalAgent so that we avoid the pod per workflow overhead. Are there any other consideration that we should be aware of when making this change?
k
Not really, just that it doesn’t run in a container
m
Perfect, thanks for the quick response!
In terms of our scaling requirement (1000 workflows) we still have a challenge. In our proposed setup we are using Dask as the executor and the workers appear to be efficient at handling our anticipated workload, the problem area is the agent, it spawns a process per workflow, ideally we''d want something like the Kubernetes Agent (which doesnt spawn processes, it submits jobs and then checks the status of those jobs). Obviously the agent is delegating execution of the flow run to the k8s job, whereas in our situation we would want an agent to manage the flow run and delegate task run execution to dask. Any suggestions on how we can achieve our 1000 workflow goal on a modest amount of resources?
k
Only local agent will not spin up new resources. I think the more you try to minimize the resources here, the less stable it will be. You could potentially get error messages from Dask that are hard to read. Is 1000 workflows going to be Flows or tasks?
m
Yep there is a trade off...this is for 1000 workflows, our workflows are very lightweight in terms of computation - they will typically be relatively short calls to mid tier services followed by sensor style operations (again supported by mid tier calls). So running a large number in a single process is more feasible (on paper)
k
Ah yeah that makes sense you should use tasks
m
Sorry I didnt explain our use case very well: we'll need heterogenous parameterised workflows, so for example a 3 step process to change permissions, copy data, revert permissions, another process to manage an EMR cluster. There will be a few dozen different workflows, but potentially hundreds of concurrent executions each with different parameters. I can imagine creating a workflow that itself pulls work from a "queue" and then generates 100s of tasks, but that's not the way we envisaged using prefect.
k
1.0 can’t dynamically spin up tasks except for
mapping
. If these run concurrently, will they be different clusters?
m
OK, our current test workflows are still basic, but the workflow dynamically pulling a "request" and building tasks was only a hypothetical workaround. We are looking to run that many workflows in a single cluster, as indicated the cpu utilisation should be low, our issue is memory overhead (and potentially io contention or python limitations). Even if we achieve 100MB memory per workflow we could only achieve 300 workflows on a 32GB cluster.
k
Ok yeah, this will definitely take some tweaking on your part to stabilize, but at least I’m confident Dask way to go here instead of mapping flows with
create_flow_run