I am facing an issue autoscaling from 0 nodes - us...
# prefect-community
m
I am facing an issue autoscaling from 0 nodes - using the aws auto-scaler on EKS - note the issue only arises when autoscaling from 0 and not 1 node … I see - someone recommended using something like this for waiting on dask worker pods
Copy code
@task
def wait_for_resources():
    client = get_client()
    # Wait until we have 10 workers
    client.wait_for_workers(n_workers=10)
but this doesn’t seem to work for waiting on the first node to be present Has anyone had the chance to try out auto-scaling from 0?
j
Thanks for the question @Marwan Sarieddine - I've not tried this but will see if others in the team have any ideas.
m
Thanks @Jenny - still trying to figure out things with prefect - any tips would be appreciated
j
Hi @Marwan Sarieddine -one immediate thought is that it wouldn't be possible for a prefect task to wait for nodes to go from 0 -> 1 since there needs to be at least 1 node in order for the flow to even start.
m
Hi @Jenny - the scheduler and agent are running on a different node group than the dask workers (i.e. in the dask worker spec I specify a node affinity to another node group )
j
If that doesn't help, can you give us a bit more information about the issue it creates for you? If the autoscaler is on and configured then once the k8s agent creates the prefect job it should start scaling up.
j
@Marwan Sarieddine Which environment are you using for your flow?
m
HI @josh _ I am using a
DaskKubernetesEnvironment
I guess to better explain my thinking, I was hoping that using prefect I can have a separation of concerns (i.e. the agent/scheduler - what is creating the work to be running on one node group - and the workers - what is actually doing the work to run on a separate node group) this would allow for one to use the same cluster and multiple worker node groups (with different instance types) to run different flows (with potentially different resource requirements)
j
Ah I see what you’re saying. I’m not entirely sure how dask-kubernetes handles having the scheduler and workers on different node groups. Looping in @Jim Crist-Harif in case he has an idea 🙂
j
Sorry, I'm having a hard time parsing this issue. dask-kubernetes supports clusters running in different node groups from the client (or schedulers running in different node groups than the workers than the client). None of this should be causing issues. Can you restate what you're expecting/hoping to happen here with your setup, and how that differs from what you're seeing?
m
Hi @Jim Crist-Harif - sorry if I wasn’t so clear communicating the issue what you are saying is true - because when I am autoscaling from a nodegroup with a minSize of 1 things work fine The setup is to have the dask workers select a node group that autoscales from a minSize of 0 instead of 1. What ends up happening is since the node is not ready - the scheduling of the worker pod is being killed prematurely - please see the event logs below for some more insight into the problem: (note the event logs start right after the flow is run)
(specifically please see the last two event logs)
Copy code
4m20s       Normal    Scheduled                                                                                                    pod/prefect-job-815860a0-gzhl6                                    Successfully assigned default/prefect-job-815860a0-gzhl6 to ip-192-168-35-156.us-west-2.compute.internal
4m20s       Normal    SuccessfulCreate                                                                                             job/prefect-job-815860a0                                          Created pod: prefect-job-815860a0-gzhl6
4m20s       Normal    Pulling                                                                                                      pod/prefect-job-815860a0-gzhl6                                    Pulling image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>"
4m17s       Normal    Pulled                                                                                                       pod/prefect-job-815860a0-gzhl6                                    Successfully pulled image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>"
4m17s       Normal    Created                                                                                                      pod/prefect-job-815860a0-gzhl6                                    Created container flow
4m16s       Normal    SuccessfulCreate                                                                                             job/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de         Created pod: prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587
4m16s       Normal    Started                                                                                                      pod/prefect-job-815860a0-gzhl6                                    Started container flow
4m16s       Normal    Scheduled                                                                                                    pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587   Successfully assigned default/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587 to ip-192-168-17-18.us-west-2.compute.internal
4m15s       Normal    Pulled                                                                                                       pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587   Container image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>" already present on machine
4m15s       Normal    Created                                                                                                      pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587   Created container flow
4m14s       Normal    Started                                                                                                      pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587   Started container flow
4m4s        Normal    Scheduled                                                                                                    pod/dask-root-7c45f7c9-art6xn                                     Successfully assigned default/dask-root-7c45f7c9-art6xn to ip-192-168-35-156.us-west-2.compute.internal
4m3s        Normal    Created                                                                                                      pod/dask-root-7c45f7c9-art6xn                                     Created container dask-worker
4m3s        Normal    Pulled                                                                                                       pod/dask-root-7c45f7c9-art6xn                                     Container image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>" already present on machine
4m3s        Normal    Started                                                                                                      pod/dask-root-7c45f7c9-art6xn                                     Started container dask-worker
4m          Normal    Killing                                                                                                      pod/dask-root-7c45f7c9-art6xn                                     Stopping container dask-worker
3m53s       Normal    Scheduled                                                                                                    pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9   Successfully assigned default/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9 to ip-192-168-35-156.us-west-2.compute.internal
3m53s       Normal    SuccessfulCreate                                                                                             job/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de         Created pod: prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9
3m52s       Normal    Pulled                                                                                                       pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9   Container image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>" already present on machine
3m52s       Normal    Created                                                                                                      pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9   Created container flow
3m51s       Normal    Started                                                                                                      pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9   Started container flow
3m3s        Warning   FailedScheduling                                                                                             pod/dask-root-0f923f19-62vfb8                                     0/2 nodes are available: 2 node(s) didn't match node selector.
3m33s       Normal    TriggeredScaleUp                                                                                             pod/dask-root-0f923f19-62vfb8                                     pod triggered scale-up: [{eksctl-prefect-eks-test-nodegroup-eks-cpu-2-NodeGroup-1PDZRBRX9TE4I 0->1 (max: 10)}]
3m33s       Normal    Killing                                                                                                      pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9   Stopping container flow
2m52s       Normal    NodeHasNoDiskPressure                                                                                        node/ip-192-168-81-183.us-west-2.compute.internal                 Node ip-192-168-81-183.us-west-2.compute.internal status is now: NodeHasNoDiskPressure
2m52s       Normal    NodeHasSufficientMemory                                                                                      node/ip-192-168-81-183.us-west-2.compute.internal                 Node ip-192-168-81-183.us-west-2.compute.internal status is now: NodeHasSufficientMemory
2m52s       Normal    NodeAllocatableEnforced                                                                                      node/ip-192-168-81-183.us-west-2.compute.internal                 Updated Node Allocatable limit across pods
2m52s       Normal    NodeHasSufficientPID                                                                                         node/ip-192-168-81-183.us-west-2.compute.internal                 Node ip-192-168-81-183.us-west-2.compute.internal status is now: NodeHasSufficientPID
2m49s       Warning   FailedScheduling                                                                                             pod/dask-root-0f923f19-62vfb8                                     0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match node selector.
2m53s       Normal    Starting                                                                                                     node/ip-192-168-81-183.us-west-2.compute.internal                 Starting kubelet.
2m51s       Normal    RegisteredNode                                                                                               node/ip-192-168-81-183.us-west-2.compute.internal                 Node ip-192-168-81-183.us-west-2.compute.internal event: Registered Node ip-192-168-81-183.us-west-2.compute.internal in Controller
2m48s       Normal    Starting                                                                                                     node/ip-192-168-81-183.us-west-2.compute.internal                 Starting kube-proxy.
2m33s       Warning   FailedScheduling                                                                                             pod/dask-root-0f923f19-62vfb8                                     skip schedule deleting pod: default/dask-root-0f923f19-62vfb8
2m32s       Normal    NodeReady                                                                                                    node/ip-192-168-81-183.us-west-2.compute.internal                 Node ip-192-168-81-183.us-west-2.compute.internal status
if I rerun the flow - and that single spawned node is now ready - the flow runs fine
I think this warning:
Copy code
2m49s       Warning   FailedScheduling                                                                                             pod/dask-root-0f923f19-62vfb8                                     0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match node selector.
the taint effect is probably NoSchedule because the node is still not ready … hence killing the dask worker pod prematurely
j
Can you share your dask-kubernetes configuration? I'm struggling to find what's triggering that pod deletion - afaict we don't have timeouts anywhere for pod startups.
m
sure - should I share the worker spec ?
j
Yeah, and if you're running the scheduler remote or local
m
remote scheduler and please see the worker spec below:
Copy code
kind: Pod
metadata:
  labels:
    app: prefect-dask-worker
spec:
  replicas: 2
  restartPolicy: Never
  imagePullSecrets:
  - name: gitlab-secret
  # note I tried using both affinity and a selector
  # nodeSelector:
  #   role: supplement
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: role
            operator: In
            values:
            - supplement
  containers:
    - image: <http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>
      imagePullPolicy: IfNotPresent
      args: [dask-worker, --nthreads, "1", --no-bokeh, --memory-limit, 4GB]
      name: dask-worker
      env:
        - name: AWS_BUCKET
          value: xxxx
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: aws-secret
              key: AWS_ACCESS_KEY_ID
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: aws-secret
              key: AWS_SECRET_ACCESS_KEY
      resources:
        limits:
          cpu: "2000m"
          memory: 4G
        requests:
          cpu: "1000m"
          memory: 2G
j
What happens with your
wait_for_resources()
task? Does it hang forever? Or does it error? Or return early but you don't have workers?
m
it just hangs forever
actually none of the flow tasks look like they are triggered (including wait_for_resources)
j
You're saying that
wait_for_resources
hasn't been called at all? Then you're likely blocking on creating the initial scheduler pod, not any of the workers.
m
Copy code
22 May 2020,01:54:21 	prefect.CloudFlowRunner	INFO	Beginning Flow run for 'Data Processing'
22 May 2020,01:54:21 	prefect.CloudFlowRunner	INFO	Starting flow run.
22 May 2020,01:54:21 	prefect.CloudFlowRunner	DEBUG	Flow 'Data Processing': Handling state change from Scheduled to Running
these are the logs - yes it hasn’t been called at all
I see - how’s that possible if the logs show a state change ? - so I guess that’s being triggered by the agent - correct ?
j
The flow runner has started, but since the executor isn't running no task runs have been started.
Can you share your scheduler configuration as well?
Mainly interested in kwargs you're passing to the KubeCluster.
m
I am not specifying job.yaml - just using the defualt
j
Do you set scheduler_service_wait_timeout?
m
i.e. here is my
DaskKubernetesEnvironment
call:
Copy code
Flow(
        "Data Processing",
        environment=DaskKubernetesEnvironment(
            worker_spec_file="worker_spec.yaml",
            min_workers=1,
            max_workers=10,
        ),
        storage=Docker(
            registry_url=os.environ['GITLAB_REGISTRY'],
            image_name="dask-k8s-flow",
            image_tag="0.1.0",
            python_dependencies=[
                'boto3==1.13.14',
                'numpy==1.18.4'
            ]
        ),
        result=s3_result,
    )
j
ok, thanks
m
Do you set scheduler_service_wait_timeout?
As you can see - No - I don’t explicitly set it
j
You said you're using a remote scheduler, did you set that as part of your dask config then?
By default deploy-mode is "local"
m
I am sorry - if by local you mean scheduler also runs on the cluster then that is what I intend
j
There are two deploy modes: • local means the scheduler runs in the same process as your prefect flow-runner (which is on k8s) • remote means the scheduler gets its own pod
m
I see
j
Are those logs from your flow run the only output you see from the pod running your flow?
m
thats the output of calling
kubectl get events
- I also shared the prefect logs which basically just state that flow runner has started
not sure what other output to inspect in this case ?
j
If the pod is still running, it'd be useful to see
kubectl logs that-pod-name
. Prefect only stores its own logs to cloud, there may be other things written to stdout that are missed.
m
I see - sure will try to look into it - only annoying thing is the pods are very ephemeral - sounds like the main culprit is the scheduler timing out prematurely…
(probably should remove the resource manager to get easier access to the pod logs - )
j
If you're running with
deploy-mode="local"
(which it sounds like you are), here's what I think has happened: • The pod running your flow runner starts • The flow runner starts • The flow runner creates a dask-kubernetes cluster. Since you're running with deploy-mode local, this creates a scheduler process in the same pod. • The flow runner submits work to the dask scheduler, which happily accepts it but has no workers to run things on. The scheduler requests worker pods from kubernetes • ??? • Things hang.
m
exactly
j
It's a bit hard to track what of the event log above is due to this flow vs other flows, but if that last pod that was deleted was for the flow runner, it's not clear why it would have been deleted. We don't impose time limits on spinning up workers.
How long have you let this run before killing it? Has the node spun up and sat for a few minutes idle with dask not submitting new workers to it?
m
I let it stay “running” for at least 10 minutes
j
That should be easily enough time. Something odd has happened.
m
question: “The scheduler requests worker pods from kubernetes” if the kubernetes cluster takes a while to provide the pods - does the scheduler time out and get killed ?
j
The scheduler is running in the local pod, and should hang indefinitely when it has work scheduled but no where to run it. It will try to get new workers spun up, but shouldn't time out at any point.
m
Oh wow - I just removed the resource-manager and re-ran to get the pod logs (i.e. prevent pods getting killed) and it looks like ran just fine
j
A new node spun up, workers were allocated, and the flow ran?
m
yep
looks like the resource-manager is killing the pod prematurely
removing the resource-manager is the only change I made here - so I am out of ideas - but i think it is causing this issue
j
Oh wow, yep, that's it. That code deletes pending pods willy-nilly. That's a bug.
j
I would buy that the resource manager is causing an issue. It has a very greedy removal of resources
j
Nice find!
⬆️ 1
I'll submit a fix asap
j
Thanks @Jim Crist-Harif
m
@Jim Crist-Harif thank you so much for helping me debug this