EKS Prefect v1 Prefect Community #ask-community

EKS Prefect v1

Lukasz Pakula

10/25/2022, 9:32 AM

Hi, i'm running prefect 1.2.2. It was all running smoothly until i upgraded the kubernetes version (eks) from 1.21 to the latest 1.23 Now i'm randomly get the following error

Copy code

INFO - Retiring workers [154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185]
INFO - Adaptive stop
INFO - Adaptive stop
ERROR - prefect.CloudFlowRunner | Unexpected error: KilledWorker('<name>', <WorkerState 'tcp://<ip>', name: 47, status: closed, memory: 0, processing: <number>, 3)

Restarting the flow is resolving the issue. Is there any sensible explanation of why upgrading kubernetes cluster could cause it? Or i must be missing something elsewhere ?

👀 1

Mason Menges

10/25/2022, 9:34 PM

Hey @Lukasz Pakula Would you mind outlining the steps you worked through while updating the EKS cluster, specifically were any prefect flows/tasks running when you updated the cluster or was the cluster shut off prior to updating?

Lukasz Pakula

10/26/2022, 12:58 PM

@Mason Menges Cluster was up and running. After the upgrade, prefect-agent was rescheduled to another node (with newest kubernetes version). After a whole day of debugging, i can see that all starts with the worker internal error (eg timeout reaching some internal service) Prefect should retry this task 5 times but it's failing immediately

Copy code

Attempted to run task <task-name> on 3 different workers, but all those workers died while running it. The last worker that attempt to run the task was <ip>. Inspecting worker logs is often a good next step to diagnose what went wrong. For more information see <https://distributed.dask.org/en/stable/killed.html>.

We have a retry delay set to 1 min, but i can see above failure ~10sec after the internal worker error

Lukasz Pakula

10/27/2022, 7:55 AM

Also, this is what i can see in the logs after the failure

Copy code

INFO:prefect.CloudTaskRunner:Task '<name>': Finished task run for task with final state: 'Retrying'
distributed._signals - INFO - Received signal SIGTERM (15)

It supposed to retry the failed task, but the SIGTERM is sent to the container at the same time. Not sure if that expected or not

Andrew Pruchinski

11/08/2022, 8:13 PM

We have ran into this same issue. Can't track down the issue still. We reduced the number of workers going at one time and it still fails. It fails pretty quickly. Any update on this?

Lukasz Pakula

11/09/2022, 8:03 AM

@Andrew Pruchinski i couldn't track down the issue either. It happened occasionally to me but make the whole prefect unreliable. What i checked • pods were not killed (OOM) • nodes were not terminated • prefect agent was reachable • restarted the flow manually usually fixed the issue I eventually rolled back to the EKS 1.21 and will upgrade again after prefect 2.0 migration

🙌 1

Mason Menges

11/09/2022, 4:51 PM

Something that might be related here, I spoke to one of our K8s experts and he suggested it might be an issue with how the clusters were upgraded, specifically Kubernetes doesn't support skipping minor versions so you'd need to upgrade the cluster from 1.21 to 1.22 to 1.23, we'd also recommend making sure you don't have any active flows/deployments running on the cluster when it's updated Relevant Docs: https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/ https://kubernetes.io/releases/version-skew-policy/

👀 1

Andrew Pruchinski

11/09/2022, 8:54 PM

thank you @Lukasz Pakula for that explanation! We are rolling back too. I am a rookie with k8s so appreciate the detailed response

Andrew Pruchinski

11/09/2022, 8:54 PM

@Mason Menges thank you!

Lukasz Pakula

11/10/2022, 7:51 AM

@Mason Menges you are not allowed to skip the minor version while upgrading EKS, you need to go one by one and this is what we did. It is a good idea to provision fresh 1.23 and see if that makes any difference though (unless @Andrew Pruchinski you have done that?)

Andrew Pruchinski

11/11/2022, 9:30 PM

@Lukasz Pakula provisioning a new one seemed to do the trick! didn't want to jinx it so waited a few days but all good now!

🙌 1

Lukasz Pakula

11/15/2022, 8:01 AM

@Andrew Pruchinski thank you for confirming!

Andrew Pruchinski

11/15/2022, 2:54 PM

And it failed again this morning with the same error as before. Looking into it again. How has yours been holding up @Lukasz Pakula?

Lukasz Pakula

11/16/2022, 10:02 AM

@Andrew Pruchinski once we migrated back to eks 1.21 the issue is gone. We will upgrade to prefect 2 before we upgrade eks again

Andrew Pruchinski

11/16/2022, 1:53 PM

@Lukasz Pakula the fresh 1.23 lasted a few days and then failed. We migrated back to 1.22 and there seems to be no issues right now. Thank you for the update

23 Views

Open in Slack

Previous Next