Hi everyone,
There is an issue where the first task in the flow is re-running before finishing. This happens 3-4 times and then the run fails with
KilledWorker
error. It keeps restarting the first task. Do you know what the issue could be?
Riley Hun
08/06/2021, 9:04 PM
Screenshots:
The first task starts to run - I can see the logs. But half way through, it starts again. No errors there. This happens 4 times and then I see "KilledWorker".
k
Kevin Kho
08/06/2021, 9:06 PM
Hey @Riley Hun, from experience, it’s a Dask specific issue and a vague one at that. But I have experienced it the most when I have mismatched versions between client, scheduler, and worker Python libraries (but mainly scheduler and worker).
r
Riley Hun
08/06/2021, 9:07 PM
Thanks @Kevin Kho, this is helpful! Will check to see if the versions are consistent.
k
Kevin Kho
08/06/2021, 9:07 PM
But here I guess memory might be an issue? It looks like this cluster is already running right? How is your memory utilization in the UI?
r
Riley Hun
08/06/2021, 9:09 PM
I believe a private dask cluster through Dask Gateway is being used. It's 12 to 20 workers. It's on a private VPC too which prevents us from viewing the UI.
k
Kevin Kho
08/06/2021, 9:10 PM
In general though, the KilledWorker literally means a worker died for some reason and couldn’t be recovered (so could be memory or hanging or something)
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.