m

    Marwan Sarieddine

    2 years ago
    I am facing an issue autoscaling from 0 nodes - using the aws auto-scaler on EKS - note the issue only arises when autoscaling from 0 and not 1 node … I see - someone recommended using something like this for waiting on dask worker pods
    @task
    def wait_for_resources():
        client = get_client()
        # Wait until we have 10 workers
        client.wait_for_workers(n_workers=10)
    but this doesn’t seem to work for waiting on the first node to be present Has anyone had the chance to try out auto-scaling from 0?
    Jenny

    Jenny

    2 years ago
    Thanks for the question @Marwan Sarieddine - I've not tried this but will see if others in the team have any ideas.
    m

    Marwan Sarieddine

    2 years ago
    Thanks @Jenny - still trying to figure out things with prefect - any tips would be appreciated
    Jenny

    Jenny

    2 years ago
    Hi @Marwan Sarieddine -one immediate thought is that it wouldn't be possible for a prefect task to wait for nodes to go from 0 -> 1 since there needs to be at least 1 node in order for the flow to even start.
    If that doesn't help, can you give us a bit more information about the issue it creates for you? If the autoscaler is on and configured then once the k8s agent creates the prefect job it should start scaling up.
    m

    Marwan Sarieddine

    2 years ago
    Hi @Jenny - the scheduler and agent are running on a different node group than the dask workers (i.e. in the dask worker spec I specify a node affinity to another node group )
    j

    josh

    2 years ago
    @Marwan Sarieddine Which environment are you using for your flow?
    m

    Marwan Sarieddine

    2 years ago
    HI @josh _ I am using a
    DaskKubernetesEnvironment
    I guess to better explain my thinking, I was hoping that using prefect I can have a separation of concerns (i.e. the agent/scheduler - what is creating the work to be running on one node group - and the workers - what is actually doing the work to run on a separate node group) this would allow for one to use the same cluster and multiple worker node groups (with different instance types) to run different flows (with potentially different resource requirements)
    j

    josh

    2 years ago
    Ah I see what you’re saying. I’m not entirely sure how dask-kubernetes handles having the scheduler and workers on different node groups. Looping in @Jim Crist-Harif in case he has an idea 🙂
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    Sorry, I'm having a hard time parsing this issue. dask-kubernetes supports clusters running in different node groups from the client (or schedulers running in different node groups than the workers than the client). None of this should be causing issues. Can you restate what you're expecting/hoping to happen here with your setup, and how that differs from what you're seeing?
    m

    Marwan Sarieddine

    2 years ago
    (specifically please see the last two event logs)
    Hi @Jim Crist-Harif - sorry if I wasn’t so clear communicating the issue what you are saying is true - because when I am autoscaling from a nodegroup with a minSize of 1 things work fine The setup is to have the dask workers select a node group that autoscales from a minSize of 0 instead of 1. What ends up happening is since the node is not ready - the scheduling of the worker pod is being killed prematurely - please see the event logs below for some more insight into the problem: (note the event logs start right after the flow is run)
    4m20s       Normal    Scheduled                                                                                                    pod/prefect-job-815860a0-gzhl6                                    Successfully assigned default/prefect-job-815860a0-gzhl6 to ip-192-168-35-156.us-west-2.compute.internal
    4m20s       Normal    SuccessfulCreate                                                                                             job/prefect-job-815860a0                                          Created pod: prefect-job-815860a0-gzhl6
    4m20s       Normal    Pulling                                                                                                      pod/prefect-job-815860a0-gzhl6                                    Pulling image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>"
    4m17s       Normal    Pulled                                                                                                       pod/prefect-job-815860a0-gzhl6                                    Successfully pulled image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>"
    4m17s       Normal    Created                                                                                                      pod/prefect-job-815860a0-gzhl6                                    Created container flow
    4m16s       Normal    SuccessfulCreate                                                                                             job/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de         Created pod: prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587
    4m16s       Normal    Started                                                                                                      pod/prefect-job-815860a0-gzhl6                                    Started container flow
    4m16s       Normal    Scheduled                                                                                                    pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587   Successfully assigned default/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587 to ip-192-168-17-18.us-west-2.compute.internal
    4m15s       Normal    Pulled                                                                                                       pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587   Container image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>" already present on machine
    4m15s       Normal    Created                                                                                                      pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587   Created container flow
    4m14s       Normal    Started                                                                                                      pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587   Started container flow
    4m4s        Normal    Scheduled                                                                                                    pod/dask-root-7c45f7c9-art6xn                                     Successfully assigned default/dask-root-7c45f7c9-art6xn to ip-192-168-35-156.us-west-2.compute.internal
    4m3s        Normal    Created                                                                                                      pod/dask-root-7c45f7c9-art6xn                                     Created container dask-worker
    4m3s        Normal    Pulled                                                                                                       pod/dask-root-7c45f7c9-art6xn                                     Container image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>" already present on machine
    4m3s        Normal    Started                                                                                                      pod/dask-root-7c45f7c9-art6xn                                     Started container dask-worker
    4m          Normal    Killing                                                                                                      pod/dask-root-7c45f7c9-art6xn                                     Stopping container dask-worker
    3m53s       Normal    Scheduled                                                                                                    pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9   Successfully assigned default/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9 to ip-192-168-35-156.us-west-2.compute.internal
    3m53s       Normal    SuccessfulCreate                                                                                             job/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de         Created pod: prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9
    3m52s       Normal    Pulled                                                                                                       pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9   Container image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>" already present on machine
    3m52s       Normal    Created                                                                                                      pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9   Created container flow
    3m51s       Normal    Started                                                                                                      pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9   Started container flow
    3m3s        Warning   FailedScheduling                                                                                             pod/dask-root-0f923f19-62vfb8                                     0/2 nodes are available: 2 node(s) didn't match node selector.
    3m33s       Normal    TriggeredScaleUp                                                                                             pod/dask-root-0f923f19-62vfb8                                     pod triggered scale-up: [{eksctl-prefect-eks-test-nodegroup-eks-cpu-2-NodeGroup-1PDZRBRX9TE4I 0->1 (max: 10)}]
    3m33s       Normal    Killing                                                                                                      pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9   Stopping container flow
    2m52s       Normal    NodeHasNoDiskPressure                                                                                        node/ip-192-168-81-183.us-west-2.compute.internal                 Node ip-192-168-81-183.us-west-2.compute.internal status is now: NodeHasNoDiskPressure
    2m52s       Normal    NodeHasSufficientMemory                                                                                      node/ip-192-168-81-183.us-west-2.compute.internal                 Node ip-192-168-81-183.us-west-2.compute.internal status is now: NodeHasSufficientMemory
    2m52s       Normal    NodeAllocatableEnforced                                                                                      node/ip-192-168-81-183.us-west-2.compute.internal                 Updated Node Allocatable limit across pods
    2m52s       Normal    NodeHasSufficientPID                                                                                         node/ip-192-168-81-183.us-west-2.compute.internal                 Node ip-192-168-81-183.us-west-2.compute.internal status is now: NodeHasSufficientPID
    2m49s       Warning   FailedScheduling                                                                                             pod/dask-root-0f923f19-62vfb8                                     0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match node selector.
    2m53s       Normal    Starting                                                                                                     node/ip-192-168-81-183.us-west-2.compute.internal                 Starting kubelet.
    2m51s       Normal    RegisteredNode                                                                                               node/ip-192-168-81-183.us-west-2.compute.internal                 Node ip-192-168-81-183.us-west-2.compute.internal event: Registered Node ip-192-168-81-183.us-west-2.compute.internal in Controller
    2m48s       Normal    Starting                                                                                                     node/ip-192-168-81-183.us-west-2.compute.internal                 Starting kube-proxy.
    2m33s       Warning   FailedScheduling                                                                                             pod/dask-root-0f923f19-62vfb8                                     skip schedule deleting pod: default/dask-root-0f923f19-62vfb8
    2m32s       Normal    NodeReady                                                                                                    node/ip-192-168-81-183.us-west-2.compute.internal                 Node ip-192-168-81-183.us-west-2.compute.internal status
    if I rerun the flow - and that single spawned node is now ready - the flow runs fine
    I think this warning:
    2m49s       Warning   FailedScheduling                                                                                             pod/dask-root-0f923f19-62vfb8                                     0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match node selector.
    the taint effect is probably NoSchedule because the node is still not ready … hence killing the dask worker pod prematurely
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    Can you share your dask-kubernetes configuration? I'm struggling to find what's triggering that pod deletion - afaict we don't have timeouts anywhere for pod startups.
    m

    Marwan Sarieddine

    2 years ago
    sure - should I share the worker spec ?
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    Yeah, and if you're running the scheduler remote or local
    m

    Marwan Sarieddine

    2 years ago
    remote scheduler and please see the worker spec below:
    kind: Pod
    metadata:
      labels:
        app: prefect-dask-worker
    spec:
      replicas: 2
      restartPolicy: Never
      imagePullSecrets:
      - name: gitlab-secret
      # note I tried using both affinity and a selector
      # nodeSelector:
      #   role: supplement
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: role
                operator: In
                values:
                - supplement
      containers:
        - image: <http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>
          imagePullPolicy: IfNotPresent
          args: [dask-worker, --nthreads, "1", --no-bokeh, --memory-limit, 4GB]
          name: dask-worker
          env:
            - name: AWS_BUCKET
              value: xxxx
            - name: AWS_ACCESS_KEY_ID
              valueFrom:
                secretKeyRef:
                  name: aws-secret
                  key: AWS_ACCESS_KEY_ID
            - name: AWS_SECRET_ACCESS_KEY
              valueFrom:
                secretKeyRef:
                  name: aws-secret
                  key: AWS_SECRET_ACCESS_KEY
          resources:
            limits:
              cpu: "2000m"
              memory: 4G
            requests:
              cpu: "1000m"
              memory: 2G
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    What happens with your
    wait_for_resources()
    task? Does it hang forever? Or does it error? Or return early but you don't have workers?
    m

    Marwan Sarieddine

    2 years ago
    it just hangs forever
    actually none of the flow tasks look like they are triggered (including wait_for_resources)
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    You're saying that
    wait_for_resources
    hasn't been called at all? Then you're likely blocking on creating the initial scheduler pod, not any of the workers.
    m

    Marwan Sarieddine

    2 years ago
    22 May 2020,01:54:21 	prefect.CloudFlowRunner	INFO	Beginning Flow run for 'Data Processing'
    22 May 2020,01:54:21 	prefect.CloudFlowRunner	INFO	Starting flow run.
    22 May 2020,01:54:21 	prefect.CloudFlowRunner	DEBUG	Flow 'Data Processing': Handling state change from Scheduled to Running
    these are the logs - yes it hasn’t been called at all
    I see - how’s that possible if the logs show a state change ? - so I guess that’s being triggered by the agent - correct ?
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    The flow runner has started, but since the executor isn't running no task runs have been started.
    Can you share your scheduler configuration as well?
    Mainly interested in kwargs you're passing to the KubeCluster.
    m

    Marwan Sarieddine

    2 years ago
    I am not specifying job.yaml - just using the defualt
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    Do you set scheduler_service_wait_timeout?
    m

    Marwan Sarieddine

    2 years ago
    i.e. here is my
    DaskKubernetesEnvironment
    call:
    Flow(
            "Data Processing",
            environment=DaskKubernetesEnvironment(
                worker_spec_file="worker_spec.yaml",
                min_workers=1,
                max_workers=10,
            ),
            storage=Docker(
                registry_url=os.environ['GITLAB_REGISTRY'],
                image_name="dask-k8s-flow",
                image_tag="0.1.0",
                python_dependencies=[
                    'boto3==1.13.14',
                    'numpy==1.18.4'
                ]
            ),
            result=s3_result,
        )
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    ok, thanks
    m

    Marwan Sarieddine

    2 years ago
    Do you set scheduler_service_wait_timeout?
    As you can see - No - I don’t explicitly set it
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    You said you're using a remote scheduler, did you set that as part of your dask config then?
    By default deploy-mode is "local"
    m

    Marwan Sarieddine

    2 years ago
    I am sorry - if by local you mean scheduler also runs on the cluster then that is what I intend
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    There are two deploy modes: • local means the scheduler runs in the same process as your prefect flow-runner (which is on k8s) • remote means the scheduler gets its own pod
    m

    Marwan Sarieddine

    2 years ago
    I see
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    Are those logs from your flow run the only output you see from the pod running your flow?
    m

    Marwan Sarieddine

    2 years ago
    thats the output of calling
    kubectl get events
    - I also shared the prefect logs which basically just state that flow runner has started
    not sure what other output to inspect in this case ?
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    If the pod is still running, it'd be useful to see
    kubectl logs that-pod-name
    . Prefect only stores its own logs to cloud, there may be other things written to stdout that are missed.
    m

    Marwan Sarieddine

    2 years ago
    I see - sure will try to look into it - only annoying thing is the pods are very ephemeral - sounds like the main culprit is the scheduler timing out prematurely…
    (probably should remove the resource manager to get easier access to the pod logs - )
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    If you're running with
    deploy-mode="local"
    (which it sounds like you are), here's what I think has happened: • The pod running your flow runner starts • The flow runner starts • The flow runner creates a dask-kubernetes cluster. Since you're running with deploy-mode local, this creates a scheduler process in the same pod. • The flow runner submits work to the dask scheduler, which happily accepts it but has no workers to run things on. The scheduler requests worker pods from kubernetes • ??? • Things hang.
    m

    Marwan Sarieddine

    2 years ago
    exactly
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    It's a bit hard to track what of the event log above is due to this flow vs other flows, but if that last pod that was deleted was for the flow runner, it's not clear why it would have been deleted. We don't impose time limits on spinning up workers.
    How long have you let this run before killing it? Has the node spun up and sat for a few minutes idle with dask not submitting new workers to it?
    m

    Marwan Sarieddine

    2 years ago
    I let it stay “running” for at least 10 minutes
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    That should be easily enough time. Something odd has happened.
    m

    Marwan Sarieddine

    2 years ago
    question: “The scheduler requests worker pods from kubernetes” if the kubernetes cluster takes a while to provide the pods - does the scheduler time out and get killed ?
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    The scheduler is running in the local pod, and should hang indefinitely when it has work scheduled but no where to run it. It will try to get new workers spun up, but shouldn't time out at any point.
    m

    Marwan Sarieddine

    2 years ago
    Oh wow - I just removed the resource-manager and re-ran to get the pod logs (i.e. prevent pods getting killed) and it looks like ran just fine
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    A new node spun up, workers were allocated, and the flow ran?
    m

    Marwan Sarieddine

    2 years ago
    yep
    looks like the resource-manager is killing the pod prematurely
    removing the resource-manager is the only change I made here - so I am out of ideas - but i think it is causing this issue
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    Oh wow, yep, that's it. That code deletes pending pods willy-nilly. That's a bug.
    j

    josh

    2 years ago
    I would buy that the resource manager is causing an issue. It has a very greedy removal of resources
    Jim Crist-Harif

    Jim Crist-Harif

    2 years ago
    Nice find!
    I'll submit a fix asap
    j

    josh

    2 years ago
    Thanks @Jim Crist-Harif
    m

    Marwan Sarieddine

    2 years ago
    @Jim Crist-Harif thank you so much for helping me debug this