simone
12/08/2020, 12:41 PMHTCondor
as workload manager and using a DaskExecutor
connected to a cluster started with dask jobqueue
.
I am mapping a function over a large number of images (20000). I started running a subset of data (5000) using a predefine number of workers and the function is mapped and run correctly using all the workers available.
If the cluster is started in adaptive mode using dask jobqueue
prefect is able to increase the number of processes run as expected by the adaptive mode and monitored in the workload manager however the size of the dask cluster doesn't change and the function isn't mapped, not even to the minimum number of workers that are predefined in the cluster. Interestingly HTCondor
allocates new running processing but it seems to be independent from the dask cluster. It seems that the dask jobqueue
initialised cluster cannot communicate with the processes started by prefect. After few minutes the processes started by prefect die printing out this error:
distributed.scheduler - ERROR - Workers don't have promised key: ['<tcp://192.168.0.2:40093>']
# the tcp address change
Any help will be greatly appreciated! Thanks!Jim Crist-Harif
12/08/2020, 3:02 PMdistributed.Client
to connect to it, and then calling client.map(time.sleep, [30] * 1000)
(or something like that) should be sufficient to test. This would help rule out bugs upstream, or issues with your setup.
If that all works appropriately, I'd want to know:
• What version of dask
, distributed
, and dask-jobqueue
are you using?
• What is your DaskExecutor
configuration?
The log about workers not having a promised key is a bit odd - that appears to be conflating a scheduler address with a key, which isn't a codepath that should ever occur.simone
12/08/2020, 3:30 PMJim Crist-Harif
12/08/2020, 3:31 PMsimone
12/08/2020, 3:32 PMJim Crist-Harif
12/08/2020, 3:55 PMtime.sleep
) instead?simone
12/08/2020, 8:05 PMtime.sleep
task things seems to work. I get more workers, the function get mapped to the number of workers and everything is fine. I am just puzzled because I run this jobs before in another setup and had no issue with memory. I run a test doubling the memory per worker to check. This doesn't fully fix the issue.The function get mapped but only to a subset of workers, after processing around 1/10 of the data it stops and number of workers is scaled down (no error). In addition, there is still a mismatch between the dask cluster size and the number of jobs reported in htcondor, I will make sure my settings are correct and if this is the case I may miss something or the issue may be with dask-jobqueue
I really appreciate the time you spent on this. Thank you so much for helping me out. Both dask and prefect made a big impact in my daily routing in analysing microscopy images and I am really thankful for all the work you are putting into them. Thanks a lot!