Emil Ordoñez
04/17/2023, 3:00 PM2023-04-16 08:09:12,613 - distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 5
2023-04-16 08:09:49,807 - distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 6
2023-04-16 08:10:29,194 - distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 7
2023-04-16 08:21:04,785 - distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 43
2023-04-16 08:21:08,311 - distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 12
2023-04-16 08:21:10,879 - distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 14
I think the previous one is the most explainatory error, as it is signaling that maybe prefect-dask si repeating worker names, this may be causing Dask Worker not registering on the Scheduler and then all those failed to register Workers didn't stop until I saw them and I stopped them manually.
I'm getting this messages in the workers:
2023-04-16 08:09:12,614 - distributed.worker - ERROR - Unable to connect to scheduler: name taken, 5
2023-04-16 08:09:12,614 - distributed.worker - INFO - Stopping worker at <tcp://172.31.39.118:34983>. Reason: worker-close
2023-04-16 08:10:11,023 - distributed.nanny - INFO - Closing Nanny at '<tcp://172.31.39.118:44953>'. Reason: nanny-close
but they're not ending, I have to stop them manually.
I've just discovered those Warnings on the Scheduler, so that may give us a pretty good hint to the actual cause of the issue.
I'm using:
prefect-dask==0.2.3
dask-cloudprovider[aws]==2022.10.0
prefect version is 2.8.7Jeff Hale
04/29/2023, 1:12 PMEmil Ordoñez
05/04/2023, 3:11 AM