Hello all, I just wanted to rule out a possibilit...
# ask-community
c
Hello all, I just wanted to rule out a possibility with my autoscaling issues I'm facing right now. If we're paying for unlimited concurrency (pay per run), is there anything else in Prefect that would prevent the scheduling of all the pods at once?
k
Hi @Charles Liu! Deleted last message. You mean the execution of multiple Flows or this is just one Flow?
Could you show your RunConfig and Executor setup?
c
So like I've been trying to load test our autoscaling group by running as many pipelines concurrently as possible
and I don't have too much information just yet in regards to why one pipeline hits an OOMKill (limited metrics on the logs side for the cluster too because I've just only gotten this set up).
Where would I find those files?
d
They’re defined as part of your flows
Also, can you share some details about your infrastructure?
What’s your cluster setup on K8s?
c
I'm using S3Results as a serializer, CodeCommit as Storage, and KuberentesRun as Run Config
on AWS I have an EKS cluster with 4 nodes
I did a lot of the setup through eksctl and kubectl
TLDR It runs as it stands, but I'm trying to find a way to definitively prove that it's scaling
EKSAgent is green and running on Prefect frontend
d
How many Flow Runs can you submit at a time?
Would you mind sharing your
RunConfig
object?
Or basically all of your Flow’s config?
for example:
Copy code
config = KubernetesRun(image="my-image-url", cpu_request=1, labels=["my-label"])
It sounds like you may also want to enable this flag to look at failed jobs
c
Copy code
RUN_CONFIG = KubernetesRun(image="ARTIFACTREPO",
                           image_pull_secrets=["AWS_CREDENTIALS"]
                        )
d
So one thing you may wanna try is to set a cpu limit and request
to make sure one job isn’t requesting all of the resources on your nodes
and they’re not in contention
But I am not a k8s expert
c
so the way I've been submitting flows is actually leveraging the schedule function
so I can just go down the list and run like 12 of them back to back
some are low resource and finish quickly, but one specifically will hit an OOMKill and it happened in a pattern when I ran it overnight
Sometimes it works, mostly it doesn't
So I see the following flags available for KuberenetesRun
*_class_ prefect.run_configs.kubernetes.KubernetesRun*(job_template_path=None, job_template=None, image=None, env=None, cpu_limit=None, cpu_request=None, memory_limit=None, memory_request=None, service_account_name=None, image_pull_secrets=None, labels=None, image_pull_policy=None)
but isn't the point of autoscaling to account for all load and not needing to set limits upfront (at least that's my goal here)
I've managed to get some kinda signal but didn't actually change anything. Thanks for the insight nonetheless!
d
@Charles Liu i’ll follow up tomorrow
h
Are you running the cluster-autoscaler on your EKS cluster?
If you're not running an autoscaler k8s has no way to spin up new nodes to deal with increased load
and this isn't something that prefect can do - prefect can only schedule pods, it's up to your k8s setup to handle creating additional machines if necessary
d
Hi Charles, that was actually going to be my question as well haha
c
My cluster is linked to an autoscaling group and that scaling group was configured manually
t
Hi Charles!
isn't the point of autoscaling to account for all load and not needing to set limits upfront
I think this is a common misconception. The key to autoscaling resiliently is to properly size your workloads. If your workload does not request any resources, then the cluster's autoscaler cannot properly anticipate requiring additional capacity. Your cluster will then overschedule work on your node, and when the node eventually runs out of gas, it will start terminating your introducing latency (sometimes fatal latency) to your workflows
If you set some high memory resource requests, you should be able to get away without setting any hard limits and you should see the cluster scale better
c
@Tyler Wanner I'm circling back to this thread now. Thanks for the insight! I'll correct my understanding of autoscaling moving forward, appreciate it! I'm going to try respeccing the memory like you said. TY!
t
great--if you run into any additional snags let me know!
🙌 1
c
Will do! Thanks for taking the time thus far 🙂