Good day! I can see
KilledWorker: ('blahblahblah', <Worker 'tcp://10.10.10.qp:38828', name: 367, memory: 0, processing: 1) in flow logs. My dask-scheduller worker timeouts are default. Task became 'Failed' after 20+ minutes since started. Which timeout option should i change ?
w
Wilson Bilkovich
08/30/2021, 7:13 PM
The Dask docs I just read say that the default is to never timeout due to idleness. There's an option for timing out after a process dies, in the form of
--death-timeout
but that doesn't sound like what is happening to you. Hmm.
b
bral
08/30/2021, 7:18 PM
As i know this option describe how long worker will wait scheduler if it will shutdown
z
Zach Angell
08/30/2021, 7:18 PM
Hi @bral what's the exact error message associated with the Task's 'Failed' state?
Is it Failed with error message
Hi ! Yes, i see this error in prefect ui if i clicked on task.
w
Wilson Bilkovich
08/30/2021, 7:24 PM
What I've been doing is
kubectl get pods -w
to see the new dask pod when it comes up, and then I do
kubectl logs -f $POD_NAME
on that pod.
upvote 1
Wilson Bilkovich
08/30/2021, 7:24 PM
Does that show a more-detailed error, when you look at the worker pod logs?
z
Zach Angell
08/30/2021, 7:29 PM
You may have already come across this, but I'd give a quick read through this guide from distributed https://distributed.dask.org/en/latest/killed.html
Like Wilson suggested, looking at the worker logs might give you a better error message
Note the special case of `KilledWorker`: this means that a particular task was tried on a worker, and it died, and then the same task was sent to another worker, which also died.
Of the possible issues in the guide, the most common one we see with Prefect + Dask is mismatched versions between the client, scheduler, and worker Python libraries
b
bral
08/30/2021, 7:32 PM
Thanks!
k
Kevin Kho
08/30/2021, 8:37 PM
Hey @bral, ask Zach linked, this happens most frequently when the workers run out of memory. Prefect doesn’t exit gracefully because it loses communicated with the infrastructure as the worker is killed. I would say that the UI does not indicate the moment the worker failed. It just marks it as failed eventually when it can’t detect the heartbeats.
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.