Marwan Sarieddine
05/22/2020, 4:36 PM@task
def wait_for_resources():
client = get_client()
# Wait until we have 10 workers
client.wait_for_workers(n_workers=10)
but this doesn’t seem to work for waiting on the first node to be present
Has anyone had the chance to try out auto-scaling from 0?Jenny
05/22/2020, 4:59 PMMarwan Sarieddine
05/22/2020, 5:00 PMJenny
05/22/2020, 5:11 PMMarwan Sarieddine
05/22/2020, 5:12 PMJenny
05/22/2020, 5:12 PMjosh
05/22/2020, 5:24 PMMarwan Sarieddine
05/22/2020, 5:40 PMDaskKubernetesEnvironment
I guess to better explain my thinking, I was hoping that using prefect I can have a separation of concerns (i.e. the agent/scheduler - what is creating the work to be running on one node group - and the workers - what is actually doing the work to run on a separate node group) this would allow for one to use the same cluster and multiple worker node groups (with different instance types) to run different flows (with potentially different resource requirements)josh
05/22/2020, 5:43 PMJim Crist-Harif
05/22/2020, 5:50 PMMarwan Sarieddine
05/22/2020, 5:58 PM4m20s Normal Scheduled pod/prefect-job-815860a0-gzhl6 Successfully assigned default/prefect-job-815860a0-gzhl6 to ip-192-168-35-156.us-west-2.compute.internal
4m20s Normal SuccessfulCreate job/prefect-job-815860a0 Created pod: prefect-job-815860a0-gzhl6
4m20s Normal Pulling pod/prefect-job-815860a0-gzhl6 Pulling image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>"
4m17s Normal Pulled pod/prefect-job-815860a0-gzhl6 Successfully pulled image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>"
4m17s Normal Created pod/prefect-job-815860a0-gzhl6 Created container flow
4m16s Normal SuccessfulCreate job/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de Created pod: prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587
4m16s Normal Started pod/prefect-job-815860a0-gzhl6 Started container flow
4m16s Normal Scheduled pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587 Successfully assigned default/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587 to ip-192-168-17-18.us-west-2.compute.internal
4m15s Normal Pulled pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587 Container image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>" already present on machine
4m15s Normal Created pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587 Created container flow
4m14s Normal Started pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-vs587 Started container flow
4m4s Normal Scheduled pod/dask-root-7c45f7c9-art6xn Successfully assigned default/dask-root-7c45f7c9-art6xn to ip-192-168-35-156.us-west-2.compute.internal
4m3s Normal Created pod/dask-root-7c45f7c9-art6xn Created container dask-worker
4m3s Normal Pulled pod/dask-root-7c45f7c9-art6xn Container image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>" already present on machine
4m3s Normal Started pod/dask-root-7c45f7c9-art6xn Started container dask-worker
4m Normal Killing pod/dask-root-7c45f7c9-art6xn Stopping container dask-worker
3m53s Normal Scheduled pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9 Successfully assigned default/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9 to ip-192-168-35-156.us-west-2.compute.internal
3m53s Normal SuccessfulCreate job/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de Created pod: prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9
3m52s Normal Pulled pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9 Container image "<http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>" already present on machine
3m52s Normal Created pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9 Created container flow
3m51s Normal Started pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9 Started container flow
3m3s Warning FailedScheduling pod/dask-root-0f923f19-62vfb8 0/2 nodes are available: 2 node(s) didn't match node selector.
3m33s Normal TriggeredScaleUp pod/dask-root-0f923f19-62vfb8 pod triggered scale-up: [{eksctl-prefect-eks-test-nodegroup-eks-cpu-2-NodeGroup-1PDZRBRX9TE4I 0->1 (max: 10)}]
3m33s Normal Killing pod/prefect-dask-job-fb5921cb-d719-4962-b1ea-f0fe5539c8de-f4wc9 Stopping container flow
2m52s Normal NodeHasNoDiskPressure node/ip-192-168-81-183.us-west-2.compute.internal Node ip-192-168-81-183.us-west-2.compute.internal status is now: NodeHasNoDiskPressure
2m52s Normal NodeHasSufficientMemory node/ip-192-168-81-183.us-west-2.compute.internal Node ip-192-168-81-183.us-west-2.compute.internal status is now: NodeHasSufficientMemory
2m52s Normal NodeAllocatableEnforced node/ip-192-168-81-183.us-west-2.compute.internal Updated Node Allocatable limit across pods
2m52s Normal NodeHasSufficientPID node/ip-192-168-81-183.us-west-2.compute.internal Node ip-192-168-81-183.us-west-2.compute.internal status is now: NodeHasSufficientPID
2m49s Warning FailedScheduling pod/dask-root-0f923f19-62vfb8 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match node selector.
2m53s Normal Starting node/ip-192-168-81-183.us-west-2.compute.internal Starting kubelet.
2m51s Normal RegisteredNode node/ip-192-168-81-183.us-west-2.compute.internal Node ip-192-168-81-183.us-west-2.compute.internal event: Registered Node ip-192-168-81-183.us-west-2.compute.internal in Controller
2m48s Normal Starting node/ip-192-168-81-183.us-west-2.compute.internal Starting kube-proxy.
2m33s Warning FailedScheduling pod/dask-root-0f923f19-62vfb8 skip schedule deleting pod: default/dask-root-0f923f19-62vfb8
2m32s Normal NodeReady node/ip-192-168-81-183.us-west-2.compute.internal Node ip-192-168-81-183.us-west-2.compute.internal status
2m49s Warning FailedScheduling pod/dask-root-0f923f19-62vfb8 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match node selector.
the taint effect is probably NoSchedule because the node is still not ready … hence killing the dask worker pod prematurelyJim Crist-Harif
05/22/2020, 6:09 PMMarwan Sarieddine
05/22/2020, 6:09 PMJim Crist-Harif
05/22/2020, 6:09 PMMarwan Sarieddine
05/22/2020, 6:11 PMkind: Pod
metadata:
labels:
app: prefect-dask-worker
spec:
replicas: 2
restartPolicy: Never
imagePullSecrets:
- name: gitlab-secret
# note I tried using both affinity and a selector
# nodeSelector:
# role: supplement
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: role
operator: In
values:
- supplement
containers:
- image: <http://registry.gitlab.com/xxxx|registry.gitlab.com/xxxx>
imagePullPolicy: IfNotPresent
args: [dask-worker, --nthreads, "1", --no-bokeh, --memory-limit, 4GB]
name: dask-worker
env:
- name: AWS_BUCKET
value: xxxx
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-secret
key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-secret
key: AWS_SECRET_ACCESS_KEY
resources:
limits:
cpu: "2000m"
memory: 4G
requests:
cpu: "1000m"
memory: 2G
Jim Crist-Harif
05/22/2020, 6:13 PMwait_for_resources()
task? Does it hang forever? Or does it error? Or return early but you don't have workers?Marwan Sarieddine
05/22/2020, 6:13 PMJim Crist-Harif
05/22/2020, 6:16 PMwait_for_resources
hasn't been called at all? Then you're likely blocking on creating the initial scheduler pod, not any of the workers.Marwan Sarieddine
05/22/2020, 6:16 PM22 May 2020,01:54:21 prefect.CloudFlowRunner INFO Beginning Flow run for 'Data Processing'
22 May 2020,01:54:21 prefect.CloudFlowRunner INFO Starting flow run.
22 May 2020,01:54:21 prefect.CloudFlowRunner DEBUG Flow 'Data Processing': Handling state change from Scheduled to Running
Jim Crist-Harif
05/22/2020, 6:21 PMMarwan Sarieddine
05/22/2020, 6:22 PMJim Crist-Harif
05/22/2020, 6:22 PMMarwan Sarieddine
05/22/2020, 6:22 PMDaskKubernetesEnvironment
call:
Flow(
"Data Processing",
environment=DaskKubernetesEnvironment(
worker_spec_file="worker_spec.yaml",
min_workers=1,
max_workers=10,
),
storage=Docker(
registry_url=os.environ['GITLAB_REGISTRY'],
image_name="dask-k8s-flow",
image_tag="0.1.0",
python_dependencies=[
'boto3==1.13.14',
'numpy==1.18.4'
]
),
result=s3_result,
)
Jim Crist-Harif
05/22/2020, 6:23 PMMarwan Sarieddine
05/22/2020, 6:23 PMDo you set scheduler_service_wait_timeout?As you can see - No - I don’t explicitly set it
Jim Crist-Harif
05/22/2020, 6:25 PMMarwan Sarieddine
05/22/2020, 6:27 PMJim Crist-Harif
05/22/2020, 6:28 PMMarwan Sarieddine
05/22/2020, 6:29 PMJim Crist-Harif
05/22/2020, 6:30 PMMarwan Sarieddine
05/22/2020, 6:31 PMkubectl get events
- I also shared the prefect logs which basically just state that flow runner has startedJim Crist-Harif
05/22/2020, 6:33 PMkubectl logs that-pod-name
. Prefect only stores its own logs to cloud, there may be other things written to stdout that are missed.Marwan Sarieddine
05/22/2020, 6:35 PMJim Crist-Harif
05/22/2020, 6:38 PMdeploy-mode="local"
(which it sounds like you are), here's what I think has happened:
• The pod running your flow runner starts
• The flow runner starts
• The flow runner creates a dask-kubernetes cluster. Since you're running with deploy-mode local, this creates a scheduler process in the same pod.
• The flow runner submits work to the dask scheduler, which happily accepts it but has no workers to run things on. The scheduler requests worker pods from kubernetes
• ???
• Things hang.Marwan Sarieddine
05/22/2020, 6:39 PMJim Crist-Harif
05/22/2020, 6:40 PMMarwan Sarieddine
05/22/2020, 6:41 PMJim Crist-Harif
05/22/2020, 6:42 PMMarwan Sarieddine
05/22/2020, 6:43 PMJim Crist-Harif
05/22/2020, 6:44 PMMarwan Sarieddine
05/22/2020, 6:44 PMJim Crist-Harif
05/22/2020, 6:45 PMMarwan Sarieddine
05/22/2020, 6:45 PMJim Crist-Harif
05/22/2020, 6:50 PMjosh
05/22/2020, 6:50 PMJim Crist-Harif
05/22/2020, 6:50 PMjosh
05/22/2020, 6:50 PMMarwan Sarieddine
05/22/2020, 6:52 PM