Vincent
01/11/2021, 2:24 PMdistributed.utils_perf - WARNING - full garbage collections took 10% CPU time recently (threshold: 10%)
...
distributed.core - INFO - Event loop was unresponsive in Scheduler for 7.07s. This is often caused by long-running GIL-holding functions or moving large chunks of data. This can cause timeouts and instability.
...
distributed.comm.tcp - INFO - Connection closed before handshake completed
...
distributed.scheduler - INFO - Close client connection: Client-worker-3c6d8642-53b5-11eb-800e-32b98c347770
When I scale the job down to 250 nodes and 3 threads per worker, I still get 100% utilization, but it is slightly more stable. where the only warning messages is
distributed.utils_perf - WARNING - full garbage collections took 11% CPU time recently (threshold: 10%)
Thanks for any advice!Marwan Sarieddine
01/11/2021, 2:27 PMVincent
01/11/2021, 2:32 PMDylan
Vincent
01/11/2021, 3:15 PMDylan
Marwan Sarieddine
01/11/2021, 3:21 PMscheduler_spec_file
to the DaskKubernetesEnvironment
Vincent
01/11/2021, 3:21 PMMarwan Sarieddine
01/11/2021, 3:22 PMDylan
Vincent
01/11/2021, 3:23 PMMarwan Sarieddine
01/11/2021, 3:24 PMVincent
01/11/2021, 11:40 PMMarwan Sarieddine
01/12/2021, 1:45 AMVincent
01/12/2021, 2:12 PMMarwan Sarieddine
01/12/2021, 7:49 PMVincent
01/12/2021, 8:21 PMMarwan Sarieddine
01/12/2021, 8:25 PM