Good day! I can see KilledWorker: ('blahblahblah'...
# ask-community
b
Good day! I can see KilledWorker: ('blahblahblah', <Worker 'tcp://10.10.10.qp:38828', name: 367, memory: 0, processing: 1) in flow logs. My dask-scheduller worker timeouts are default. Task became 'Failed' after 20+ minutes since started. Which timeout option should i change ?
w
The Dask docs I just read say that the default is to never timeout due to idleness. There's an option for timing out after a process dies, in the form of
--death-timeout
but that doesn't sound like what is happening to you. Hmm.
b
As i know this option describe how long worker will wait scheduler if it will shutdown
z
Hi @bral what's the exact error message associated with the Task's 'Failed' state? Is it Failed with error message
KilledWorker: ('blahblahblah', <Worker '<tcp://10.10.10.qp:38828>', name: 367, memory: 0, processing ...
b
Hi ! Yes, i see this error in prefect ui if i clicked on task.
w
What I've been doing is
kubectl get pods -w
to see the new dask pod when it comes up, and then I do
kubectl logs -f $POD_NAME
on that pod.
upvote 1
Does that show a more-detailed error, when you look at the worker pod logs?
z
You may have already come across this, but I'd give a quick read through this guide from distributed https://distributed.dask.org/en/latest/killed.html Like Wilson suggested, looking at the worker logs might give you a better error message
Note the special case of `KilledWorker`: this means that a particular task was tried on a worker, and it died, and then the same task was sent to another worker, which also died.
Of the possible issues in the guide, the most common one we see with Prefect + Dask is mismatched versions between the client, scheduler, and worker Python libraries
b
Thanks!
k
Hey @bral, ask Zach linked, this happens most frequently when the workers run out of memory. Prefect doesn’t exit gracefully because it loses communicated with the infrastructure as the worker is killed. I would say that the UI does not indicate the moment the worker failed. It just marks it as failed eventually when it can’t detect the heartbeats.