I am facing an issue autoscaling from 0 nodes using the aws Prefect Community #ask-community

I am facing an issue autoscaling from 0 nodes - us...

Marwan Sarieddine

05/22/2020, 4:36 PM

I am facing an issue autoscaling from 0 nodes - using the aws auto-scaler on EKS - note the issue only arises when autoscaling from 0 and not 1 node … I see - someone recommended using something like this for waiting on dask worker pods

Copy code

@task
def wait_for_resources():
    client = get_client()
    # Wait until we have 10 workers
    client.wait_for_workers(n_workers=10)

but this doesn’t seem to work for waiting on the first node to be present Has anyone had the chance to try out auto-scaling from 0?

Jenny

05/22/2020, 4:59 PM

Thanks for the question @Marwan Sarieddine - I've not tried this but will see if others in the team have any ideas.

Marwan Sarieddine

05/22/2020, 5:00 PM

Thanks @Jenny - still trying to figure out things with prefect - any tips would be appreciated

Jenny

05/22/2020, 5:11 PM

Hi @Marwan Sarieddine -one immediate thought is that it wouldn't be possible for a prefect task to wait for nodes to go from 0 -> 1 since there needs to be at least 1 node in order for the flow to even start.

Jenny

05/22/2020, 5:12 PM

If that doesn't help, can you give us a bit more information about the issue it creates for you? If the autoscaler is on and configured then once the k8s agent creates the prefect job it should start scaling up.

Marwan Sarieddine

05/22/2020, 5:12 PM

Hi @Jenny - the scheduler and agent are running on a different node group than the dask workers (i.e. in the dask worker spec I specify a node affinity to another node group )

josh

05/22/2020, 5:24 PM

@Marwan Sarieddine Which environment are you using for your flow?

Marwan Sarieddine

05/22/2020, 5:40 PM

HI @josh _ I am using a

DaskKubernetesEnvironment

I guess to better explain my thinking, I was hoping that using prefect I can have a separation of concerns (i.e. the agent/scheduler - what is creating the work to be running on one node group - and the workers - what is actually doing the work to run on a separate node group) this would allow for one to use the same cluster and multiple worker node groups (with different instance types) to run different flows (with potentially different resource requirements)

josh

05/22/2020, 5:43 PM

Ah I see what you’re saying. I’m not entirely sure how dask-kubernetes handles having the scheduler and workers on different node groups. Looping in @Jim Crist-Harif in case he has an idea 🙂

Jim Crist-Harif

05/22/2020, 5:50 PM

Sorry, I'm having a hard time parsing this issue. dask-kubernetes supports clusters running in different node groups from the client (or schedulers running in different node groups than the workers than the client). None of this should be causing issues. Can you restate what you're expecting/hoping to happen here with your setup, and how that differs from what you're seeing?

Marwan Sarieddine

05/22/2020, 5:58 PM

(specifically please see the last two event logs)

Marwan Sarieddine

05/22/2020, 5:58 PM

Copy code

4m20s       Normal    Scheduled                                                                                                    pod/prefect-job-815860a0-gzhl6                                    Successfully assigned default/prefect-job-815860a0-gzhl6 to ip-192-168-35-156.us-west-2.compute.internal
4m20s       Normal    SuccessfulCreate                                                                                             job/prefect-job-815860a0                                          Created pod: prefect-job-815860a0-gzhl6
4m20s       Normal    Pulling                                                                                                      pod/prefect-job-815860a0-gzhl6                                    Pulling image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>"
4m17s       Normal    Pulled                                                                                                       pod/prefect-job-815860a0-gzhl6                                    Successfully pulled image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>"
4m17s       Normal    Created                                                                                                      pod/prefect-job-815860a0-gzhl6                                    Created container flow
4m16s       Normal    SuccessfulCreate                                                                                             job/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de         Created pod: prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587
4m16s       Normal    Started                                                                                                      pod/prefect-job-815860a0-gzhl6                                    Started container flow
4m16s       Normal    Scheduled                                                                                                    pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587   Successfully assigned default/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587 to ip-192-168-17-18.us-west-2.compute.internal
4m15s       Normal    Pulled                                                                                                       pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587   Container image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>" already present on machine
4m15s       Normal    Created                                                                                                      pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587   Created container flow
4m14s       Normal    Started                                                                                                      pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587   Started container flow
4m4s        Normal    Scheduled                                                                                                    pod/dask-root-7c45f7c9-art6xn                                     Successfully assigned default/dask-root-7c45f7c9-art6xn to ip-192-168-35-156.us-west-2.compute.internal
4m3s        Normal    Created                                                                                                      pod/dask-root-7c45f7c9-art6xn                                     Created container dask-worker
4m3s        Normal    Pulled                                                                                                       pod/dask-root-7c45f7c9-art6xn                                     Container image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>" already present on machine
4m3s        Normal    Started                                                                                                      pod/dask-root-7c45f7c9-art6xn                                     Started container dask-worker
4m          Normal    Killing                                                                                                      pod/dask-root-7c45f7c9-art6xn                                     Stopping container dask-worker
3m53s       Normal    Scheduled                                                                                                    pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9   Successfully assigned default/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9 to ip-192-168-35-156.us-west-2.compute.internal
3m53s       Normal    SuccessfulCreate                                                                                             job/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de         Created pod: prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9
3m52s       Normal    Pulled                                                                                                       pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9   Container image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>" already present on machine
3m52s       Normal    Created                                                                                                      pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9   Created container flow
3m51s       Normal    Started                                                                                                      pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9   Started container flow
3m3s        Warning   FailedScheduling                                                                                             pod/dask-root-0f923f19-62vfb8                                     0/2 nodes are available: 2 node(s) didn't match node selector.
3m33s       Normal    TriggeredScaleUp                                                                                             pod/dask-root-0f923f19-62vfb8                                     pod triggered scale-up: [{eksctl-prefect-eks-test-nodegroup-eks-cpu-2-NodeGroup-1PDZRBRX9TE4I 0->1 (max: 10)}]
3m33s       Normal    Killing                                                                                                      pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9   Stopping container flow
2m52s       Normal    NodeHasNoDiskPressure                                                                                        node/ip-192-168-81-183.us-west-2.compute.internal                 Node ip-192-168-81-183.us-west-2.compute.internal status is now: NodeHasNoDiskPressure
2m52s       Normal    NodeHasSufficientMemory                                                                                      node/ip-192-168-81-183.us-west-2.compute.internal                 Node ip-192-168-81-183.us-west-2.compute.internal status is now: NodeHasSufficientMemory
2m52s       Normal    NodeAllocatableEnforced                                                                                      node/ip-192-168-81-183.us-west-2.compute.internal                 Updated Node Allocatable limit across pods
2m52s       Normal    NodeHasSufficientPID                                                                                         node/ip-192-168-81-183.us-west-2.compute.internal                 Node ip-192-168-81-183.us-west-2.compute.internal status is now: NodeHasSufficientPID
2m49s       Warning   FailedScheduling                                                                                             pod/dask-root-0f923f19-62vfb8                                     0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match node selector.
2m53s       Normal    Starting                                                                                                     node/ip-192-168-81-183.us-west-2.compute.internal                 Starting kubelet.
2m51s       Normal    RegisteredNode                                                                                               node/ip-192-168-81-183.us-west-2.compute.internal                 Node ip-192-168-81-183.us-west-2.compute.internal event: Registered Node ip-192-168-81-183.us-west-2.compute.internal in Controller
2m48s       Normal    Starting                                                                                                     node/ip-192-168-81-183.us-west-2.compute.internal                 Starting kube-proxy.
2m33s       Warning   FailedScheduling                                                                                             pod/dask-root-0f923f19-62vfb8                                     skip schedule deleting pod: default/dask-root-0f923f19-62vfb8
2m32s       Normal    NodeReady                                                                                                    node/ip-192-168-81-183.us-west-2.compute.internal                 Node ip-192-168-81-183.us-west-2.compute.internal status

Marwan Sarieddine

05/22/2020, 5:58 PM

Hi @Jim Crist-Harif - sorry if I wasn’t so clear communicating the issue what you are saying is true - because when I am autoscaling from a nodegroup with a minSize of 1 things work fine The setup is to have the dask workers select a node group that autoscales from a minSize of 0 instead of 1. What ends up happening is since the node is not ready - the scheduling of the worker pod is being killed prematurely - please see the event logs below for some more insight into the problem: (note the event logs start right after the flow is run)

Marwan Sarieddine

05/22/2020, 6:07 PM

if I rerun the flow - and that single spawned node is now ready - the flow runs fine

Marwan Sarieddine

05/22/2020, 6:09 PM

I think this warning:

Copy code

2m49s       Warning   FailedScheduling                                                                                             pod/dask-root-0f923f19-62vfb8                                     0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match node selector.

the taint effect is probably NoSchedule because the node is still not ready … hence killing the dask worker pod prematurely

Jim Crist-Harif

05/22/2020, 6:09 PM

Can you share your dask-kubernetes configuration? I'm struggling to find what's triggering that pod deletion - afaict we don't have timeouts anywhere for pod startups.

Marwan Sarieddine

05/22/2020, 6:09 PM

sure - should I share the worker spec ?

Jim Crist-Harif

05/22/2020, 6:09 PM

Yeah, and if you're running the scheduler remote or local

Marwan Sarieddine

05/22/2020, 6:11 PM

remote scheduler and please see the worker spec below:

Copy code

kind: Pod
metadata:
  labels:
    app: prefect-dask-worker
spec:
  replicas: 2
  restartPolicy: Never
  imagePullSecrets:
  - name: gitlab-secret
  # note I tried using both affinity and a selector
  # nodeSelector:
  #   role: supplement
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: role
            operator: In
            values:
            - supplement
  containers:
    - image: <http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>
      imagePullPolicy: IfNotPresent
      args: [dask-worker, --nthreads, "1", --no-bokeh, --memory-limit, 4GB]
      name: dask-worker
      env:
        - name: AWS_BUCKET
          value: xxxx
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: aws-secret
              key: AWS_ACCESS_KEY_ID
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: aws-secret
              key: AWS_SECRET_ACCESS_KEY
      resources:
        limits:
          cpu: "2000m"
          memory: 4G
        requests:
          cpu: "1000m"
          memory: 2G

Jim Crist-Harif

05/22/2020, 6:13 PM

What happens with your

wait_for_resources()

task? Does it hang forever? Or does it error? Or return early but you don't have workers?

Marwan Sarieddine

05/22/2020, 6:13 PM

it just hangs forever

Marwan Sarieddine

05/22/2020, 6:14 PM

actually none of the flow tasks look like they are triggered (including wait_for_resources)

Jim Crist-Harif

05/22/2020, 6:16 PM

You're saying that

wait_for_resources

hasn't been called at all? Then you're likely blocking on creating the initial scheduler pod, not any of the workers.

Marwan Sarieddine

05/22/2020, 6:16 PM

Copy code

22 May 2020,01:54:21 	prefect.CloudFlowRunner	INFO	Beginning Flow run for 'Data Processing'
22 May 2020,01:54:21 	prefect.CloudFlowRunner	INFO	Starting flow run.
22 May 2020,01:54:21 	prefect.CloudFlowRunner	DEBUG	Flow 'Data Processing': Handling state change from Scheduled to Running

Marwan Sarieddine

05/22/2020, 6:16 PM

these are the logs - yes it hasn’t been called at all

Marwan Sarieddine

05/22/2020, 6:17 PM

I see - how’s that possible if the logs show a state change ? - so I guess that’s being triggered by the agent - correct ?

Jim Crist-Harif

05/22/2020, 6:21 PM

The flow runner has started, but since the executor isn't running no task runs have been started.

Jim Crist-Harif

05/22/2020, 6:21 PM

Can you share your scheduler configuration as well?

Jim Crist-Harif

05/22/2020, 6:21 PM

Mainly interested in kwargs you're passing to the KubeCluster.

Marwan Sarieddine

05/22/2020, 6:22 PM

I am not specifying job.yaml - just using the defualt

Jim Crist-Harif

05/22/2020, 6:22 PM

Do you set scheduler_service_wait_timeout?

Marwan Sarieddine

05/22/2020, 6:22 PM

i.e. here is my

DaskKubernetesEnvironment

call:

Copy code

Flow(
        "Data Processing",
        environment=DaskKubernetesEnvironment(
            worker_spec_file="worker_spec.yaml",
            min_workers=1,
            max_workers=10,
        ),
        storage=Docker(
            registry_url=os.environ['GITLAB_REGISTRY'],
            image_name="dask-k8s-flow",
            image_tag="0.1.0",
            python_dependencies=[
                'boto3==1.13.14',
                'numpy==1.18.4'
            ]
        ),
        result=s3_result,
    )

Jim Crist-Harif

05/22/2020, 6:23 PM

ok, thanks

Marwan Sarieddine

05/22/2020, 6:23 PM

Do you set scheduler_service_wait_timeout?

As you can see - No - I don’t explicitly set it

Jim Crist-Harif

05/22/2020, 6:25 PM

You said you're using a remote scheduler, did you set that as part of your dask config then?

Jim Crist-Harif

05/22/2020, 6:25 PM

https://github.com/dask/dask-kubernetes/blob/master/dask_kubernetes/kubernetes.yaml#L11

Jim Crist-Harif

05/22/2020, 6:26 PM

By default deploy-mode is "local"

Marwan Sarieddine

05/22/2020, 6:27 PM

I am sorry - if by local you mean scheduler also runs on the cluster then that is what I intend

Jim Crist-Harif

05/22/2020, 6:28 PM

There are two deploy modes: • local means the scheduler runs in the same process as your prefect flow-runner (which is on k8s) • remote means the scheduler gets its own pod

Marwan Sarieddine

05/22/2020, 6:29 PM

I see

Jim Crist-Harif

05/22/2020, 6:30 PM

Are those logs from your flow run the only output you see from the pod running your flow?

Marwan Sarieddine

05/22/2020, 6:31 PM

thats the output of calling

kubectl get events

- I also shared the prefect logs which basically just state that flow runner has started

Marwan Sarieddine

05/22/2020, 6:32 PM

not sure what other output to inspect in this case ?

Jim Crist-Harif

05/22/2020, 6:33 PM

If the pod is still running, it'd be useful to see

kubectl logs that-pod-name

. Prefect only stores its own logs to cloud, there may be other things written to stdout that are missed.

Marwan Sarieddine

05/22/2020, 6:35 PM

I see - sure will try to look into it - only annoying thing is the pods are very ephemeral - sounds like the main culprit is the scheduler timing out prematurely…

Marwan Sarieddine

05/22/2020, 6:37 PM

(probably should remove the resource manager to get easier access to the pod logs - )

Jim Crist-Harif

05/22/2020, 6:38 PM

If you're running with

deploy-mode="local"

(which it sounds like you are), here's what I think has happened: • The pod running your flow runner starts • The flow runner starts • The flow runner creates a dask-kubernetes cluster. Since you're running with deploy-mode local, this creates a scheduler process in the same pod. • The flow runner submits work to the dask scheduler, which happily accepts it but has no workers to run things on. The scheduler requests worker pods from kubernetes • ??? • Things hang.

Marwan Sarieddine

05/22/2020, 6:39 PM

exactly

Jim Crist-Harif

05/22/2020, 6:40 PM

It's a bit hard to track what of the event log above is due to this flow vs other flows, but if that last pod that was deleted was for the flow runner, it's not clear why it would have been deleted. We don't impose time limits on spinning up workers.

Jim Crist-Harif

05/22/2020, 6:41 PM

How long have you let this run before killing it? Has the node spun up and sat for a few minutes idle with dask not submitting new workers to it?

Marwan Sarieddine

05/22/2020, 6:41 PM

I let it stay “running” for at least 10 minutes

Jim Crist-Harif

05/22/2020, 6:42 PM

That should be easily enough time. Something odd has happened.

Marwan Sarieddine

05/22/2020, 6:43 PM

question: “The scheduler requests worker pods from kubernetes” if the kubernetes cluster takes a while to provide the pods - does the scheduler time out and get killed ?

Jim Crist-Harif

05/22/2020, 6:44 PM

The scheduler is running in the local pod, and should hang indefinitely when it has work scheduled but no where to run it. It will try to get new workers spun up, but shouldn't time out at any point.

Marwan Sarieddine

05/22/2020, 6:44 PM

Oh wow - I just removed the resource-manager and re-ran to get the pod logs (i.e. prevent pods getting killed) and it looks like ran just fine

Jim Crist-Harif

05/22/2020, 6:45 PM

A new node spun up, workers were allocated, and the flow ran?

Marwan Sarieddine

05/22/2020, 6:45 PM

yep

Marwan Sarieddine

05/22/2020, 6:45 PM

looks like the resource-manager is killing the pod prematurely

Marwan Sarieddine

05/22/2020, 6:48 PM

removing the resource-manager is the only change I made here - so I am out of ideas - but i think it is causing this issue

Jim Crist-Harif

05/22/2020, 6:50 PM

Oh wow, yep, that's it. That code deletes pending pods willy-nilly. That's a bug.

josh

05/22/2020, 6:50 PM

I would buy that the resource manager is causing an issue. It has a very greedy removal of resources

Jim Crist-Harif

05/22/2020, 6:50 PM

Nice find!

⬆️ 1

Jim Crist-Harif

05/22/2020, 6:50 PM

I'll submit a fix asap

josh

05/22/2020, 6:50 PM

Thanks @Jim Crist-Harif

Marwan Sarieddine

05/22/2020, 6:52 PM

@Jim Crist-Harif thank you so much for helping me debug this

3 Views

Open in Slack

Previous Next