Lukasz Pakula
10/25/2022, 9:32 AMINFO - Retiring workers [154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185]
INFO - Adaptive stop
INFO - Adaptive stop
ERROR - prefect.CloudFlowRunner | Unexpected error: KilledWorker('<name>', <WorkerState 'tcp://<ip>', name: 47, status: closed, memory: 0, processing: <number>, 3)
Restarting the flow is resolving the issue.
Is there any sensible explanation of why upgrading kubernetes cluster could cause it? Or i must be missing something elsewhere ?Mason Menges
10/25/2022, 9:34 PMLukasz Pakula
10/26/2022, 12:58 PMAttempted to run task <task-name> on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was <ip>. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see <https://distributed.dask.org/en/stable/killed.html>.
We have a retry delay set to 1 min, but i can see above failure ~10sec after the internal worker errorLukasz Pakula
10/27/2022, 7:55 AMINFO:prefect.CloudTaskRunner:Task '<name>': Finished task run for task with final state: 'Retrying'
distributed._signals - INFO - Received signal SIGTERM (15)
It supposed to retry the failed task, but the SIGTERM is sent to the container at the same time. Not sure if that expected or notAndrew Pruchinski
11/08/2022, 8:13 PMLukasz Pakula
11/09/2022, 8:03 AMMason Menges
11/09/2022, 4:51 PMAndrew Pruchinski
11/09/2022, 8:54 PMAndrew Pruchinski
11/09/2022, 8:54 PMLukasz Pakula
11/10/2022, 7:51 AMAndrew Pruchinski
11/11/2022, 9:30 PMLukasz Pakula
11/15/2022, 8:01 AMAndrew Pruchinski
11/15/2022, 2:54 PMLukasz Pakula
11/16/2022, 10:02 AMAndrew Pruchinski
11/16/2022, 1:53 PM