Good day! I can see KilledWorker: ('blahblahblah', <Worker '<tcp://10.10.10.qp:38828|tcp://10.10...

bral

08/30/2021, 7:09 PM

Good day! I can see KilledWorker: ('blahblahblah', <Worker 'tcp://10.10.10.qp:38828', name: 367, memory: 0, processing: 1) in flow logs. My dask-scheduller worker timeouts are default. Task became 'Failed' after 20+ minutes since started. Which timeout option should i change ?

Wilson Bilkovich

08/30/2021, 7:13 PM

The Dask docs I just read say that the default is to never timeout due to idleness. There's an option for timing out after a process dies, in the form of

--death-timeout

but that doesn't sound like what is happening to you. Hmm.

bral

08/30/2021, 7:18 PM

As i know this option describe how long worker will wait scheduler if it will shutdown

Zach Angell

08/30/2021, 7:18 PM

Hi @bral what's the exact error message associated with the Task's 'Failed' state? Is it Failed with error message

KilledWorker: ('blahblahblah', <Worker '<tcp://10.10.10.qp:38828>', name: 367, memory: 0, processing ...

bral

08/30/2021, 7:22 PM

Hi ! Yes, i see this error in prefect ui if i clicked on task.

Wilson Bilkovich

08/30/2021, 7:24 PM

What I've been doing is

kubectl get pods -w

to see the new dask pod when it comes up, and then I do

kubectl logs -f $POD_NAME

on that pod.

upvote 1

Wilson Bilkovich

08/30/2021, 7:24 PM

Does that show a more-detailed error, when you look at the worker pod logs?

Zach Angell

08/30/2021, 7:29 PM

You may have already come across this, but I'd give a quick read through this guide from distributed https://distributed.dask.org/en/latest/killed.html Like Wilson suggested, looking at the worker logs might give you a better error message

Note the special case of `KilledWorker`: this means that a particular task was tried on a worker, and it died, and then the same task was sent to another worker, which also died.

Of the possible issues in the guide, the most common one we see with Prefect + Dask is mismatched versions between the client, scheduler, and worker Python libraries

bral

08/30/2021, 7:32 PM

Thanks!

Kevin Kho

08/30/2021, 8:37 PM

Hey @bral, ask Zach linked, this happens most frequently when the workers run out of memory. Prefect doesn’t exit gracefully because it loses communicated with the infrastructure as the worker is killed. I would say that the UI does not indicate the moment the worker failed. It just marks it as failed eventually when it can’t detect the heartbeats.

2 Views

Open in Slack

Previous Next

Prefect Community

Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.