Thread
#prefect-community
    c

    Charles Liu

    1 year ago
    Hello all, I just wanted to rule out a possibility with my autoscaling issues I'm facing right now. If we're paying for unlimited concurrency (pay per run), is there anything else in Prefect that would prevent the scheduling of all the pods at once?
    Kevin Kho

    Kevin Kho

    1 year ago
    Hi @Charles Liu! Deleted last message. You mean the execution of multiple Flows or this is just one Flow?
    Could you show your RunConfig and Executor setup?
    c

    Charles Liu

    1 year ago
    So like I've been trying to load test our autoscaling group by running as many pipelines concurrently as possible
    and I don't have too much information just yet in regards to why one pipeline hits an OOMKill (limited metrics on the logs side for the cluster too because I've just only gotten this set up).
    Where would I find those files?
    Dylan

    Dylan

    1 year ago
    They’re defined as part of your flows
    Also, can you share some details about your infrastructure?
    What’s your cluster setup on K8s?
    c

    Charles Liu

    1 year ago
    I'm using S3Results as a serializer, CodeCommit as Storage, and KuberentesRun as Run Config
    on AWS I have an EKS cluster with 4 nodes
    I did a lot of the setup through eksctl and kubectl
    TLDR It runs as it stands, but I'm trying to find a way to definitively prove that it's scaling
    EKSAgent is green and running on Prefect frontend
    Dylan

    Dylan

    1 year ago
    How many Flow Runs can you submit at a time?
    Would you mind sharing your
    RunConfig
    object?
    Or basically all of your Flow’s config?
    for example:
    config = KubernetesRun(image="my-image-url", cpu_request=1, labels=["my-label"])
    It sounds like you may also want to enable this flag to look at failed jobs
    c

    Charles Liu

    1 year ago
    RUN_CONFIG = KubernetesRun(image="ARTIFACTREPO",
                               image_pull_secrets=["AWS_CREDENTIALS"]
                            )
    Dylan

    Dylan

    1 year ago
    So one thing you may wanna try is to set a cpu limit and request
    to make sure one job isn’t requesting all of the resources on your nodes
    and they’re not in contention
    But I am not a k8s expert
    c

    Charles Liu

    1 year ago
    so the way I've been submitting flows is actually leveraging the schedule function
    so I can just go down the list and run like 12 of them back to back
    some are low resource and finish quickly, but one specifically will hit an OOMKill and it happened in a pattern when I ran it overnight
    Sometimes it works, mostly it doesn't
    So I see the following flags available for KuberenetesRun
    class prefect.run_configs.kubernetes.KubernetesRun(job_template_path=None, job_template=None, image=None, env=None, cpu_limit=None, cpu_request=None, memory_limit=None, memory_request=None, service_account_name=None, image_pull_secrets=None, labels=None, image_pull_policy=None)
    but isn't the point of autoscaling to account for all load and not needing to set limits upfront (at least that's my goal here)
    I've managed to get some kinda signal but didn't actually change anything. Thanks for the insight nonetheless!
    Dylan

    Dylan

    1 year ago
    @Charles Liu i’ll follow up tomorrow
    h

    Hugo Shi

    1 year ago
    Are you running the cluster-autoscaler on your EKS cluster?
    If you're not running an autoscaler k8s has no way to spin up new nodes to deal with increased load
    and this isn't something that prefect can do - prefect can only schedule pods, it's up to your k8s setup to handle creating additional machines if necessary
    Dylan

    Dylan

    1 year ago
    Hi Charles, that was actually going to be my question as well haha
    c

    Charles Liu

    1 year ago
    My cluster is linked to an autoscaling group and that scaling group was configured manually
    Tyler Wanner

    Tyler Wanner

    1 year ago
    Hi Charles!
    isn't the point of autoscaling to account for all load and not needing to set limits upfront
    I think this is a common misconception. The key to autoscaling resiliently is to properly size your workloads. If your workload does not request any resources, then the cluster's autoscaler cannot properly anticipate requiring additional capacity. Your cluster will then overschedule work on your node, and when the node eventually runs out of gas, it will start terminating your introducing latency (sometimes fatal latency) to your workflows
    If you set some high memory resource requests, you should be able to get away without setting any hard limits and you should see the cluster scale better
    c

    Charles Liu

    1 year ago
    @Tyler Wanner I'm circling back to this thread now. Thanks for the insight! I'll correct my understanding of autoscaling moving forward, appreciate it! I'm going to try respeccing the memory like you said. TY!
    Tyler Wanner

    Tyler Wanner

    1 year ago
    great--if you run into any additional snags let me know!
    c

    Charles Liu

    1 year ago
    Will do! Thanks for taking the time thus far 🙂