EKS Prefect v1
# ask-community
l
Hi, i'm running prefect 1.2.2. It was all running smoothly until i upgraded the kubernetes version (eks) from 1.21 to the latest 1.23 Now i'm randomly get the following error
Copy code
INFO - Retiring workers [154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185]
INFO - Adaptive stop
INFO - Adaptive stop
ERROR - prefect.CloudFlowRunner | Unexpected error: KilledWorker('<name>', <WorkerState 'tcp://<ip>', name: 47, status: closed, memory: 0, processing: <number>, 3)
Restarting the flow is resolving the issue. Is there any sensible explanation of why upgrading kubernetes cluster could cause it? Or i must be missing something elsewhere ?
šŸ‘€ 1
m
Hey @Lukasz Pakula Would you mind outlining the steps you worked through while updating the EKS cluster, specifically were any prefect flows/tasks running when you updated the cluster or was the cluster shut off prior to updating?
l
@Mason Menges Cluster was up and running. After the upgrade, prefect-agent was rescheduled to another node (with newest kubernetes version). After a whole day of debugging, i can see that all starts with the worker internal error (eg timeout reaching some internal service) Prefect should retry this task 5 times but it's failing immediately
Copy code
Attempted to run task <task-name> on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was <ip>. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see <https://distributed.dask.org/en/stable/killed.html>.
We have a retry delay set to 1 min, but i can see above failure ~10sec after the internal worker error
Also, this is what i can see in the logs after the failure
Copy code
INFO:prefect.CloudTaskRunner:Task '<name>': Finished task run for task with final state: 'Retrying'
distributed._signals - INFO - Received signal SIGTERM (15)
It supposed to retry the failed task, but the SIGTERM is sent to the container at the same time. Not sure if that expected or not
a
We have ran into this same issue. Can't track down the issue still. We reduced the number of workers going at one time and it still fails. It fails pretty quickly. Any update on this?
l
@Andrew Pruchinski i couldn't track down the issue either. It happened occasionally to me but make the whole prefect unreliable. What i checked • pods were not killed (OOM) • nodes were not terminated • prefect agent was reachable • restarted the flow manually usually fixed the issue I eventually rolled back to the EKS 1.21 and will upgrade again after prefect 2.0 migration
šŸ™Œ 1
m
Something that might be related here, I spoke to one of our K8s experts and he suggested it might be an issue with how the clusters were upgraded, specifically Kubernetes doesn't support skipping minor versions so you'd need to upgrade the cluster from 1.21 to 1.22 to 1.23, we'd also recommend making sure you don't have any active flows/deployments running on the cluster when it's updated Relevant Docs: https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/ https://kubernetes.io/releases/version-skew-policy/
šŸ‘€ 1
a
thank you @Lukasz Pakula for that explanation! We are rolling back too. I am a rookie with k8s so appreciate the detailed response
@Mason Menges thank you!
l
@Mason Menges you are not allowed to skip the minor version while upgrading EKS, you need to go one by one and this is what we did. It is a good idea to provision fresh 1.23 and see if that makes any difference though (unless @Andrew Pruchinski you have done that?)
a
@Lukasz Pakula provisioning a new one seemed to do the trick! didn't want to jinx it so waited a few days but all good now!
šŸ™Œ 1
l
@Andrew Pruchinski thank you for confirming!
a
And it failed again this morning with the same error as before. Looking into it again. How has yours been holding up @Lukasz Pakula?
l
@Andrew Pruchinski once we migrated back to eks 1.21 the issue is gone. We will upgrade to prefect 2 before we upgrade eks again
a
@Lukasz Pakula the fresh 1.23 lasted a few days and then failed. We migrated back to 1.22 and there seems to be no issues right now. Thank you for the update