https://prefect.io logo
Title
e

Emil Ordoñez

04/17/2023, 3:00 PM
Hello people, hope everyone is having a prefect time!! I'm currently using prefect-dask collection using Fargate Cluster. Everything is running OK except for some random cases in which I'm having a bunch of Dask Workers running for a long time after Prefect Tasks have already finished, I have seen workers running more than 1 day after the Tasks finish. I have to manually stop the worker's ECS Tasks. Whenever this happens, the Dask Scheduler is not running anymore. It seems like the scheduler didn't have the chance to kill the workers. For more context: Dask is creating the Scheduler and the Workers for me, I'm only creating Task Definitions, Roles. So Dask is not creating nothing, it's just managing the creation of running the Scheduler and Workers. I'm getting these errors on the Scheduler:
2023-04-16 08:09:12,613 - distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 5
2023-04-16 08:09:49,807 - distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 6
2023-04-16 08:10:29,194 - distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 7
2023-04-16 08:21:04,785 - distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 43
2023-04-16 08:21:08,311 - distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 12
2023-04-16 08:21:10,879 - distributed.scheduler - WARNING - Worker tried to connect with a duplicate name: 14
I think the previous one is the most explainatory error, as it is signaling that maybe prefect-dask si repeating worker names, this may be causing Dask Worker not registering on the Scheduler and then all those failed to register Workers didn't stop until I saw them and I stopped them manually. I'm getting this messages in the workers:
2023-04-16 08:09:12,614 - distributed.worker - ERROR - Unable to connect to scheduler: name taken, 5
2023-04-16 08:09:12,614 - distributed.worker - INFO - Stopping worker at <tcp://172.31.39.118:34983>. Reason: worker-close
2023-04-16 08:10:11,023 - distributed.nanny - INFO - Closing Nanny at '<tcp://172.31.39.118:44953>'. Reason: nanny-close
but they're not ending, I have to stop them manually. I've just discovered those Warnings on the Scheduler, so that may give us a pretty good hint to the actual cause of the issue. I'm using:
prefect-dask==0.2.3
dask-cloudprovider[aws]==2022.10.0
prefect version is 2.8.7
j

Jeff Hale

04/29/2023, 1:12 PM
Hi Emil. It looks like there’s a similar open issue here. I believe there has been some work in this area. If you upgrade to the latest Prefect version, 2.10.6, does your issue resolve?
e

Emil Ordoñez

05/04/2023, 3:11 AM
Jeff, thanks for the response. Yeap, that open issue seems similar to our issue. I'll check if that is already solved in the new version. Again, thanks for the reply.