Hello all I just wanted to rule out a possibility with my au Prefect Community #ask-community

Hello all, I just wanted to rule out a possibilit...

Charles Liu

05/18/2021, 8:12 PM

Hello all, I just wanted to rule out a possibility with my autoscaling issues I'm facing right now. If we're paying for unlimited concurrency (pay per run), is there anything else in Prefect that would prevent the scheduling of all the pods at once?

Kevin Kho

05/18/2021, 8:34 PM

Hi @Charles Liu! Deleted last message. You mean the execution of multiple Flows or this is just one Flow?

Kevin Kho

05/18/2021, 8:35 PM

Could you show your RunConfig and Executor setup?

Charles Liu

05/18/2021, 9:39 PM

So like I've been trying to load test our autoscaling group by running as many pipelines concurrently as possible

Charles Liu

05/18/2021, 9:40 PM

and I don't have too much information just yet in regards to why one pipeline hits an OOMKill (limited metrics on the logs side for the cluster too because I've just only gotten this set up).

Charles Liu

05/18/2021, 9:40 PM

Where would I find those files?

Dylan

05/18/2021, 9:54 PM

They’re defined as part of your flows

Dylan

05/18/2021, 9:54 PM

Also, can you share some details about your infrastructure?

Dylan

05/18/2021, 9:54 PM

What’s your cluster setup on K8s?

Charles Liu

05/18/2021, 9:58 PM

I'm using S3Results as a serializer, CodeCommit as Storage, and KuberentesRun as Run Config

Charles Liu

05/18/2021, 9:58 PM

on AWS I have an EKS cluster with 4 nodes

Charles Liu

05/18/2021, 9:59 PM

I did a lot of the setup through eksctl and kubectl

Charles Liu

05/18/2021, 10:00 PM

TLDR It runs as it stands, but I'm trying to find a way to definitively prove that it's scaling

Charles Liu

05/18/2021, 10:01 PM

EKSAgent is green and running on Prefect frontend

Dylan

05/18/2021, 10:02 PM

How many Flow Runs can you submit at a time?

Dylan

05/18/2021, 10:03 PM

Would you mind sharing your

RunConfig

object?

Dylan

05/18/2021, 10:03 PM

Or basically all of your Flow’s config?

Dylan

05/18/2021, 10:03 PM

for example:

Copy code

config = KubernetesRun(image="my-image-url", cpu_request=1, labels=["my-label"])

Dylan

05/18/2021, 10:04 PM

It sounds like you may also want to enable this flag to look at failed jobs

Dylan

05/18/2021, 10:05 PM

https://github.com/PrefectHQ/prefect/pull/4351

Charles Liu

05/18/2021, 10:05 PM

Copy code

RUN_CONFIG = KubernetesRun(image="ARTIFACTREPO",
                           image_pull_secrets=["AWS_CREDENTIALS"]
                        )

Dylan

05/18/2021, 10:06 PM

So one thing you may wanna try is to set a cpu limit and request

Dylan

05/18/2021, 10:06 PM

to make sure one job isn’t requesting all of the resources on your nodes

Dylan

05/18/2021, 10:06 PM

and they’re not in contention

Dylan

05/18/2021, 10:06 PM

But I am not a k8s expert

Charles Liu

05/18/2021, 10:07 PM

so the way I've been submitting flows is actually leveraging the schedule function

Charles Liu

05/18/2021, 10:07 PM

so I can just go down the list and run like 12 of them back to back

Charles Liu

05/18/2021, 10:08 PM

some are low resource and finish quickly, but one specifically will hit an OOMKill and it happened in a pattern when I ran it overnight

Charles Liu

05/18/2021, 10:08 PM

Sometimes it works, mostly it doesn't

Charles Liu

05/18/2021, 10:14 PM

So I see the following flags available for KuberenetesRun

Charles Liu

05/18/2021, 10:14 PM

*_class_ prefect.run_configs.kubernetes.KubernetesRun*(job_template_path=None, job_template=None, image=None, env=None, cpu_limit=None, cpu_request=None, memory_limit=None, memory_request=None, service_account_name=None, image_pull_secrets=None, labels=None, image_pull_policy=None)

Charles Liu

05/18/2021, 10:14 PM

but isn't the point of autoscaling to account for all load and not needing to set limits upfront (at least that's my goal here)

Charles Liu

05/18/2021, 10:38 PM

I've managed to get some kinda signal but didn't actually change anything. Thanks for the insight nonetheless!

Dylan

05/19/2021, 12:03 AM

@Charles Liu i’ll follow up tomorrow

Hugo Shi

05/19/2021, 4:44 PM

Are you running the cluster-autoscaler on your EKS cluster?

Hugo Shi

05/19/2021, 4:45 PM

If you're not running an autoscaler k8s has no way to spin up new nodes to deal with increased load

Hugo Shi

05/19/2021, 4:45 PM

and this isn't something that prefect can do - prefect can only schedule pods, it's up to your k8s setup to handle creating additional machines if necessary

Dylan

05/19/2021, 5:22 PM

Hi Charles, that was actually going to be my question as well haha

Charles Liu

05/19/2021, 7:07 PM

My cluster is linked to an autoscaling group and that scaling group was configured manually

Tyler Wanner

05/19/2021, 8:04 PM

Hi Charles!

isn't the point of autoscaling to account for all load and not needing to set limits upfront

I think this is a common misconception. The key to autoscaling resiliently is to properly size your workloads. If your workload does not request any resources, then the cluster's autoscaler cannot properly anticipate requiring additional capacity. Your cluster will then overschedule work on your node, and when the node eventually runs out of gas, it will start terminating your introducing latency (sometimes fatal latency) to your workflows

Tyler Wanner

05/19/2021, 8:19 PM

If you set some high memory resource requests, you should be able to get away without setting any hard limits and you should see the cluster scale better

Charles Liu

05/24/2021, 9:40 PM

@Tyler Wanner I'm circling back to this thread now. Thanks for the insight! I'll correct my understanding of autoscaling moving forward, appreciate it! I'm going to try respeccing the memory like you said. TY!

Tyler Wanner

05/24/2021, 9:41 PM

great--if you run into any additional snags let me know!

🙌 1

Charles Liu

05/24/2021, 9:42 PM

Will do! Thanks for taking the time thus far 🙂

3 Views

Open in Slack

Previous Next