I ve managed to get around most of the problems I had with r Prefect Community #ask-community

I've managed to get around most of the problems I ...

haf

11/27/2021, 12:05 PM

I've managed to get around most of the problems I had with retries and stability on Dask, but this one eludes me. I'm getting the

KilledWorker

error which seemingly fails the whole flow. Despite this, the workers are alive and fine (more in thread)

haf

11/27/2021, 12:05 PM

Copy code

dask-worker-d4cdcb698-n9w86             1/1     Running   0          11m
dask-worker-d4cdcb698-p7f4c             1/1     Running   0          13m
dask-worker-d4cdcb698-qljjm             1/1     Running   0          5m30s
dask-worker-d4cdcb698-qpbpp             1/1     Running   0          8m9s
dask-worker-d4cdcb698-rrsf2             1/1     Running   0          14m

haf

11/27/2021, 12:05 PM

haf

11/27/2021, 12:06 PM

I've set GCSResult on most things, and the flow continues to run like planned; so I'm not sure what the killed worker was, nor why it fails the whole flow because of it, or what I can do about it?

haf

11/27/2021, 12:12 PM

Previously I ran preemptible nodes, but I'm happy to wait for this to be resolved before doing back to them.

haf

11/27/2021, 12:21 PM

What I do have, however, is an HorizontalPodAutoscaler, that scales the workers up and down depending on how much CPU they are consuming. This makes the prefect job (the Dask client) talk in terms of adding/removing workers, which in turn might be a buggy code path (?) causing the "KilledWorker" message?

Copy code

prefect-job-9c43910a-v6pqj flow [2021-11-27 12:19:29+0000] DEBUG - prefect.DaskExecutor | Worker <tcp://10.4.6.3:42033> removed
prefect-job-9c43910a-v6pqj flow [2021-11-27 12:19:33+0000] DEBUG - prefect.CloudFlowRunner | Checking flow run state...
prefect-job-9c43910a-v6pqj flow [2021-11-27 12:19:35+0000] DEBUG - prefect.DaskExecutor | Worker <tcp://10.4.10.4:44069> removed
prefect-job-9c43910a-v6pqj flow [2021-11-27 12:19:48+0000] DEBUG - prefect.CloudFlowRunner | Checking flow run state...
prefect-job-9c43910a-v6pqj flow [2021-11-27 12:20:04+0000] DEBUG - prefect.CloudFlowRunner | Checking flow run state...
prefect-job-9c43910a-v6pqj flow [2021-11-27 12:20:12+0000] DEBUG - prefect.DaskExecutor | Worker <tcp://10.4.10.4:44007> added
prefect-job-9c43910a-v6pqj flow [2021-11-27 12:20:19+0000] DEBUG - prefect.CloudFlowRunner | Checking flow run state...
prefect-job-9c43910a-v6pqj flow [2021-11-27 12:20:25+0000] DEBUG - prefect.DaskExecutor | Worker <tcp://10.4.6.3:41491> added
prefect-job-9c43910a-v6pqj flow [2021-11-27 12:20:34+0000] DEBUG - prefect.CloudFlowRunner | Checking flow run state...
prefect-job-9c43910a-v6pqj flow [2021-11-27 12:20:49+0000] DEBUG - prefect.CloudFlowRunner | Checking flow run state...

later

prefect-job-9c43910a-v6pqj flow [2021-11-27 12:31:52+0000] ERROR - prefect.CloudFlowRunner | Unexpected error: KilledWorker('fit_spend2txs_model-4-2e39f830c95848798566c200821d6e9a', <WorkerState '<tcp://10.4.6.3:38793>', name: <tcp://10.4.6.3:38793>, status: closed, memory: 0, processing: 11>)
prefect-job-9c43910a-v6pqj flow Traceback (most recent call last):
prefect-job-9c43910a-v6pqj flow   File "/usr/local/lib/python3.9/site-packages/prefect/engine/runner.py", line 48, in inner
prefect-job-9c43910a-v6pqj flow     new_state = method(self, state, *args, **kwargs)
prefect-job-9c43910a-v6pqj flow   File "/usr/local/lib/python3.9/site-packages/prefect/engine/flow_runner.py", line 643, in get_flow_run_state
prefect-job-9c43910a-v6pqj flow     final_states = executor.wait(
prefect-job-9c43910a-v6pqj flow   File "/usr/local/lib/python3.9/site-packages/prefect/executors/dask.py", line 440, in wait
prefect-job-9c43910a-v6pqj flow     return self.client.gather(futures)
prefect-job-9c43910a-v6pqj flow   File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 1969, in gather
prefect-job-9c43910a-v6pqj flow     return self.sync(
prefect-job-9c43910a-v6pqj flow   File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 865, in sync
prefect-job-9c43910a-v6pqj flow     return sync(
prefect-job-9c43910a-v6pqj flow   File "/usr/local/lib/python3.9/site-packages/distributed/utils.py", line 327, in sync
prefect-job-9c43910a-v6pqj flow     raise exc.with_traceback(tb)
prefect-job-9c43910a-v6pqj flow   File "/usr/local/lib/python3.9/site-packages/distributed/utils.py", line 310, in f
prefect-job-9c43910a-v6pqj flow     result[0] = yield future
prefect-job-9c43910a-v6pqj flow   File "/usr/local/lib/python3.9/site-packages/tornado/gen.py", line 762, in run
prefect-job-9c43910a-v6pqj flow     value = future.result()
prefect-job-9c43910a-v6pqj flow   File "/usr/local/lib/python3.9/site-packages/distributed/client.py", line 1834, in _gather
prefect-job-9c43910a-v6pqj flow     raise exception.with_traceback(traceback)
prefect-job-9c43910a-v6pqj flow distributed.scheduler.KilledWorker: ('fit_spend2txs_model-4-2e39f830c95848798566c200821d6e9a', <WorkerState '<tcp://10.4.6.3:38793>', name: <tcp://10.4.6.3:38793>, status: closed, memory: 0, processing: 11>)
prefect-job-9c43910a-v6pqj flow [2021-11-27 12:31:52+0000] DEBUG - prefect.CloudFlowRunner | Flow 'run_mmm': Handling state change from Running to Failed

Anna Geller

11/27/2021, 5:13 PM

How did you define the HorizontalPodAutoscaler? This memory: 0 in the error on client gather still looks like not enough memory is allocated on the worker, or did I misunderstand? If you believe it’s some error on the Dask executor side, would you mind creating an issue for it and give us info on under what circumstances did it appear (Prefect & distributed version, Dask setup, etc)? But if you think it’s still a dask distributed error similar to the previous one you had, perhaps it makes sense adding it to your issue in the Dask repo?

Anna Geller

11/27/2021, 9:11 PM

@haf I think this documentation page provides a solution to your problem, or at least it gives hints of what you should check http://distributed.dask.org/en/stable/killed.html

haf

11/29/2021, 4:34 PM

@Anna Geller Thank you for your replies. It was defined with

minReplicas: 3, maxReplicas: 9

but I've removed it for now and made the cluster a permanent 5 replicas for now. I honestly don't know where this problem might be: whether it's Dask or Prefect or something else. Perhaps when Orion is finished and it can ship traces (and possibly dask too?) this will be easier to debug. Bumping the thread count on the Dask Worker (nthreads) and making them permanent made it more stable for now, and I got a couple of successful runs (2h 30m)

Anna Geller

11/29/2021, 4:56 PM

thanks for the update, nice work!

haf

11/29/2021, 4:57 PM

Thank you. I hope I can spend more time debugging why scaling in and our servers and using preemptibles fails, in the future.

haf

11/30/2021, 8:52 AM

Update on the KilledWorker; it still happens but not as frequently. This is the best log entry I've found of it:

distributed.scheduler - INFO - Task infer_quantities-8-56a955e4e6224f5f9822da85821c36f5 marked as failed because 3 workers died while trying to run it

The only problem being I can't find any logs from the worker about this and most runs succeed. Hmm

haf

11/30/2021, 9:01 AM

I know it was not killed by the Nanny (sounds like a board game!), because the "Worker exceeded X memory budget" is not present in the logs. Also can't find "End worker" from the section "Worker chose to exit"

haf

11/30/2021, 9:01 AM

Seeing a stacktrace would be nice; but I'm not seeing one

haf

11/30/2021, 9:03 AM

I don't think it OOM:ed because I've given them 30 GiB each and it should run in less than 3 GiB.

Anna Geller

11/30/2021, 10:02 AM

interesting! Were you able to see in the logs if this task/job “infer_quantities-8-56a955e4e6224f5f9822da85821c36f5” was retried on another worker? A retry on another worker seems to be what Dask should do in this case.

haf

11/30/2021, 10:02 AM

No, I didn't find any retry in this case

haf

11/30/2021, 10:04 AM

I didn't look that deep for it, as there are very many lines of logs per second but a cursory search for

ies-8

yielded no results

👍 1

haf

12/12/2021, 5:46 PM

OK, so back to debugging here: Now it's happening once a day. I've gotten metrics up and running, and it correlates with all 16 cores becoming used across all nodes in the cluster and a whole lot of context switching and network traffic

haf

12/15/2021, 8:58 AM

Any ideas?

Kevin Kho

12/15/2021, 2:28 PM

If it’s memory that’s the issue, I have ideas, but not CPU. It looks like it’s CPU killing the worker. This is the first time I’ve seen this kind of thing

haf

12/15/2021, 2:29 PM

Exciting 🙂

haf

12/15/2021, 2:30 PM

It's literally thinking itself to death?

haf

12/15/2021, 2:30 PM

However, the algorithm is not that well written to be consuming 16 cores to 100% over that long periods of time, so it's more likely we're actually in a tight loop in the library codes (most likely in Dask)

Kevin Kho

12/15/2021, 2:36 PM

I think so right? Sorry but I don’t really have idea on this. Are you using a mapped task that shares common inputs? Like do you have a small dataframe all the mapped tasks use?

haf

12/15/2021, 2:38 PM

For the task when it's crashing (the one that runs for 2 hours non-stop on four cores), there are two or three dataframes in play, unique to that task-index

Kevin Kho

12/15/2021, 2:40 PM

Do you load them inside the task or are they passed in?

haf

12/15/2021, 3:10 PM

They are passed in from other tasks.

haf

12/15/2021, 3:11 PM

The aim was to make that loading cacheable but I don't think I've succeeded in doing that despite using Results.

Kevin Kho

12/15/2021, 3:13 PM

Are they pandas DataFrames or Dask DataFrames?

haf

12/15/2021, 3:18 PM

Pandas for now

Kevin Kho

12/15/2021, 3:21 PM

If network usage is high along with CPU, I would explore how to reduce that. Can the task read them in maybe? This is all just a guess though

Kevin Kho

12/15/2021, 3:23 PM

I thought you potentially had one dataframe that kept moving around or being copied for tasks. For that, you could try using Dask

scatter

to move it to the workers ahead of time

haf

12/15/2021, 3:24 PM

Looks like this: so I think only CPU is high

haf

12/15/2021, 3:24 PM

But I mean; 5 MiB

haf

12/15/2021, 3:27 PM

haf

12/15/2021, 3:28 PM

But there are no real errors in the logs from the worker and while it's taking a pod-wide lock (which fails once in a while) the task should be retried (and then it mostly works)

haf

12/15/2021, 3:29 PM

And Dask only reports nominal CPU usage

Kevin Kho

12/15/2021, 3:29 PM

That’s even weirder (the Dask CPU Utilization)

haf

12/15/2021, 3:35 PM

Yes

George Coyne

12/15/2021, 7:40 PM

Hey all! Catching up on the conversation

davzucky

12/15/2021, 9:25 PM

Which version of dask and distributed are you using? Could it be that dask is rebalancing there cached object How many dask worker do you have power pod?

haf

12/16/2021, 10:04 AM

We're using 2021.10.0-py3.9 and distributed 2021.10.0.

Copy code

FROM daskdev/dask:2021.10.0-py3.9

RUN apt-get update \
    && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends tzdata build-essential \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /opt/app

ENV POETRY_VERSION=1.1.11

RUN pip install "poetry==$POETRY_VERSION"

COPY poetry.lock pyproject.toml postinstall.py ./
COPY --chown=1000:100 infer ./infer

RUN POETRY_VIRTUALENVS_CREATE=false poetry install --no-interaction --no-ansi

RUN python postinstall.py

We run with

nprocs=2

per pod, with

ntreads=16

and we have five pods, sticky to five VM:s as the single workload pods running (monitoring pods for extracting system metrics are running side-by-side).

haf

12/16/2021, 10:36 AM

Copy code

dask-worker-85784599b8-2z79p dask-worker distributed.nanny - INFO - Worker process 2772 was killed by signal 11
dask-worker-85784599b8-2z79p dask-worker distributed.nanny - WARNING - Restarting worker

haf

12/17/2021, 8:36 AM

Is there any way to trace what's going on?

Kevin Kho

12/17/2021, 2:56 PM

We’ll see if davzucky has ideas. This is a beyong us on the Prefect side

haf

12/30/2021, 5:53 AM

Any progress on this?

davzucky

01/03/2022, 11:47 PM

@haf Sorry for the delay, and sorry for what I will ask. This will be a lot of question... • When the kill worker happen, what do you see on the dask scheduler side ? • Do you see the problem as a dask worker killed or hearbeat problem ? • Why are you not running multiple pod with 1CPU rather than pod with higher CPU limit?

🙏 1

davzucky

01/03/2022, 11:49 PM

The first think we will need is to have a stable run without HPA, The HPA is another beast we can look at later

davzucky

01/03/2022, 11:50 PM

• Can you share the kubernetes conf of your dask worker and dask scheduler setup ? • Is your prefect agent running on the same kubernetes cluster as the dask cluster ?

davzucky

01/03/2022, 11:54 PM

One more comment, I found running with Nanny on kubernetes to not be reliable on kubernetes. I prefer having pod with one worker CPU only and let kubernetes do the management. Nanny can be noisy on the mode

haf

01/17/2022, 7:55 AM

When the kill worker happen, what do you see on the dask scheduler side ?

distributed.scheduler - INFO - Task infer_quantities-8-56a955e4e6224f5f9822da85821c36f5 marked as failed because 3 workers died while trying to run it

Do you see the problem as a dask worker killed or hearbeat problem ?

I don't know. My guess is that somewhere in the lib code there's a code path that gets used and causes CPU to spike. Since it's using up all 16 cores, it's probably a re-entrant / async bit of code. This effectively makes comms stop and heartbeat timeout.

Why are you not running multiple pod with 1CPU rather than pod with higher CPU limit?

My thinking was that we'd do both parallelism in the small (data-level) and large (process/task level) — when this thread was started we only had in the large, but now we've rebuilt all of this to plain python, four CPU:s (we have four sampling chains), and are in the process of merging xarray support that parallelises also on data-level getting us to about 90% of 16 cores for one computation.

The first think we will need is to have a stable run without HPA, The HPA is another beast we can look at later

Yes, it's gone.

haf

01/17/2022, 7:56 AM

worker conf

Copy code

---
# Source: dask/templates/dask-worker-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dask-worker

  labels:
    app: dask
    component: worker

spec:
  replicas: 5

  selector:
    matchLabels:
      app: dask
      component: worker

  strategy:
    type: RollingUpdate

  template:
    metadata:
      labels:
        app: dask
        component: worker

      # <https://github.com/dask/dask-kubernetes/issues/197>
      annotations:
        <http://sidecar.istio.io/inject|sidecar.istio.io/inject>: "false"

    spec:
      serviceAccountName: dask-worker

      tolerations:
      - key: dedicated
        operator: Equal
        value: dask
        effect: NoSchedule

      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: dedicated
                operator: In
                values:
                - dask

      containers:
      - name: dask-worker
        image: europe-docker.pkg.dev/logary-delivery/cd/dask
        # image: daskdev/dask:2021.10.0

        args:
        - dask-worker
        - dask-scheduler.flows.svc:8786
        - --no-dashboard
        - --nthreads
        - '20'
        - --nprocs
        - '2'
        - --dashboard-address
        - "8790"
        - --memory-limit
        - 30GB
        - --death-timeout
        - '60'

        env:
        - name: EXTRA_PIP_PACKAGES
          value: fastparquet murmurhash distributed gcsfs

        - name: PREFECT__LOGGING__LEVEL
          value: DEBUG

        - name: PREFECT__CONTEXT__SECRETS__LOGARY_PG_USER
          valueFrom:
            secretKeyRef:
              name: analytics-pguser-modelruns
              key: user

        - name: PREFECT__CONTEXT__SECRETS__LOGARY_PG_PASSWORD
          valueFrom:
            secretKeyRef:
              name: analytics-pguser-modelruns
              key: password

        - name: K8S_NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName

        ports:
        - name: http-dashboard
          containerPort: 8790

        resources:
          requests:
            cpu: "15000m"
            memory: 60G
          limits:
            cpu: "16000m"
            memory: 60G

haf

01/17/2022, 7:56 AM

scheduler conf

Copy code

---
# Source: dask/templates/dask-scheduler-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: dask-scheduler

  labels:
    app: dask
    component: scheduler

spec:
  replicas: 1
  selector:
    matchLabels:
      app: dask
      component: scheduler

  strategy:
    type: RollingUpdate

  template:
    metadata:
      labels:
        app: dask
        component: scheduler

      # <https://github.com/dask/dask-kubernetes/issues/197>
      annotations:
        <http://sidecar.istio.io/inject|sidecar.istio.io/inject>: "false"

    spec:
      containers:
      - name: dask-scheduler
        image: europe-docker.pkg.dev/logary-delivery/cd/dask
        # image: daskdev/dask:2021.10.0
        
        args:
        - dask-scheduler
        - --port
        - "8786"
        - --bokeh-port
        - "8787"

        ports:
        - name: tcp-scheduler
          containerPort: 8786

        - name: http-webui
          containerPort: 8787

        resources:
          requests:
            memory: 512Mi
            cpu: 500m
          limits:
            memory: 4Gi
            cpu: 1000m

haf

01/17/2022, 7:56 AM

Is your prefect agent running on the same kubernetes cluster as the dask cluster ?

Yes

haf

01/17/2022, 7:57 AM

Dockerfile

Copy code

# <https://github.com/dask/dask-docker/blob/main/base/Dockerfile>
# <https://docs.dask.org/en/latest/how-to/deploy-dask/docker.html>
# <https://stackoverflow.com/questions/53835198/integrating-python-poetry-with-docker>
#
FROM daskdev/dask:2021.10.0-py3.9

ARG COMMIT_SHA
ARG COMMIT_REF

RUN apt-get update \
    && DEBIAN_FRONTEND=noninteractive apt-get install -y --no-install-recommends tzdata build-essential \
    && apt-get clean \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /opt/app

ENV POETRY_VERSION=1.1.11

RUN pip install "poetry==$POETRY_VERSION"

COPY poetry.lock pyproject.toml postinstall.py ./
COPY --chown=1000:100 infer ./infer

RUN POETRY_VIRTUALENVS_CREATE=false poetry install --no-interaction --no-ansi

RUN python postinstall.py

ENV COMMIT_SHA=${COMMIT_SHA} COMMIT_REF=${COMMIT_REF}

# test 1: import mmm
RUN python -c 'import mmm'

# test 2: import further in
RUN python -c 'from mmm.data.fetching import METRICS_COL_NAMES'

davzucky

01/17/2022, 9:07 AM

Ok thank you for sharing all of that. Because you are creating worker with 20 threads the GIL and be locked and the process not responding in function of your task https://docs.dask.org/en/latest/how-to/deploy-dask/single-machine.html

davzucky

01/17/2022, 9:09 AM

This is why Im running all my worker with one thread one CPU and it max everything at 100% during the run because the dask scheduler try to send the task where the data is

davzucky

01/17/2022, 9:09 AM

Could you try running alot more smaller worker?

davzucky

01/17/2022, 9:10 AM

In our case we have some worker which required high memory and are using tag to address them for the task

14 Views

Open in Slack

Previous Next