Hi Prefect Community We are running our infrastructure on GK Prefect Community #ask-community

Hi Prefect Community! We are running our infrastr...

Keith

02/03/2023, 8:12 PM

Hi Prefect Community! We are running our infrastructure on GKE Autopilot and have been seeing an increase in the number of

Crashed

jobs recently. I am trying to do root cause analysis on it so I start by digging through the logs in Prefect Cloud. What I see is that at some point (it is random) the logs stop and nothing further is output to the UI. Digging through the logs in Google Logs Explorer I see the same behavior, Prefect container logs stop at the same specific point in time. Inside Google Logs I am also able to see a lot of Kubernetes related logs and am starting to see a pattern but not clear how to fix it. • Roughly 5-10 seconds after the last log this shows up: ◦

INFO 2023-02-03T19:18:11Z [resource.labels.nodeName: gk3-prefect-autopilot-cl-nap-ji2s72nv-db29cac6-hxzc] marked the node as toBeDeleted/unschedulable

• Quickly followed by: ◦

INFO 2023-02-03T19:18:11Z [resource.labels.clusterName: prefect-autopilot-cluster-1] Scale-down: removing node gk3-prefect-autopilot-cl-nap-ji2s72nv-db29cac6-hxzc, utilization: {0.5538631957906397 0.1841863664058054 0 cpu 0.5538631957906397}, pods to reschedule: adorable-axolotl-d8k8c-6dx5c

◦

INFO 2023-02-03T19:18:38Z [resource.labels.clusterName: prefect-autopilot-cluster-1] Scale-down: node gk3-prefect-autopilot-cl-nap-ji2s72nv-db29cac6-hxzc removed with drain

• GKE tries to reschedule the job but it fails with the following, which is when Prefect alerts for the

Crashed

state: ◦

INFO 2023-02-03T19:18:11Z [resource.labels.podName: adorable-axolotl-d8k8c-6dx5c] deleting pod for node scale down

◦

ERROR 2023-02-03T19:18:19.215934101Z [resource.labels.containerName: prefect-job] 19:18:19.214 | INFO | prefect.engine - Engine execution of flow run '8ca83100-dcc3-46d5-91be-f342b19b45a9' aborted by orchestrator: This run cannot transition to the RUNNING state from the RUNNING state.

This appears to be happening on jobs randomly and leads me to believe that GKE believes the cluster is overprovisioned so it is trying to reduce the cluster size and move jobs around, but jobs can't be moved in the middle of execution and Crash/Fail. I am also curious if this is due to resource sizing, but I am not seeing any issues with the jobs I have been troubleshooting with

insufficient resource

problems. They all typically state the following in the

containerStatuses

leaf of the JSON element with the following message:

Copy code

state: {
terminated: {
containerID: "<containerd://aac705>"
exitCode: 143
finishedAt: "2023-02-03T19:18:19Z"
reason: "Error"
startedAt: "2023-02-03T19:16:52Z"
}}

Any incite would be greatly appreciated!

Walter Cavinaw

02/03/2023, 10:05 PM

i have the same setup and similar problem. Hard to debug as well because it happens randomly!

Walter Cavinaw

02/04/2023, 12:13 AM

This stackoverflow post was helpful to shine some light on it. https://stackoverflow.com/questions/71509986/with-gke-autopilot-banning-the-cluster-autoscaler-kubernetes-io-safe-to-evict-fa out of curiosity how many flows are you running when this happens? I only notice this error when we have many things running at the same time.

Walter Cavinaw

02/04/2023, 12:40 AM

and this https://issuetracker.google.com/issues/227162588

Keith

02/04/2023, 3:12 AM

out of curiosity how many flows are you running when this happens?

We had no issues with this up until ~3 weeks ago when the number of our jobs went up much higher. It happens more often when there are more jobs running at the same time but I have also observed it happening when only 2 jobs are running.

Keith

02/04/2023, 3:13 AM

Thanks for the SO post, I had found that this morning as well and am working on a post to this thread b/c I'm not sure that setting will help. But maybe the setting in your issue tracker would be helpful.

Walter Cavinaw

02/04/2023, 3:15 AM

it might also be worth trying to add emptyDir local storage. In these docs it says that a pod will not get scaled down if it has local storage attached: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node

Walter Cavinaw

02/04/2023, 8:23 PM

Just FYI I've paid to escalate the issue using GCP support. It was 500. This is quite disruptive for us, so if you can, do you mind also making a comment on that issue tracker and clicking the +1 in the top right. The more attention this gets the better. maybe they can fix it soon.

Keith

02/06/2023, 9:39 PM

do you mind also making a comment on that issue tracker and clicking the +1 in the top right.

Done, thank you for leading this!

Keith

02/06/2023, 9:40 PM

I am also curious if anyone on the Prefect team has a comment on this?

Keith

02/06/2023, 9:40 PM

it might also be worth trying to add emptyDir local storage.

Have you tried this?

Walter Cavinaw

02/07/2023, 12:04 AM

yes I tried this. It didn't work. The only thing I didn't try is add a pod disruption budget because it seems like a lot of work and not guaranteed to work because there are limits to this.

😮‍💨 1

Walter Cavinaw

02/07/2023, 12:05 AM

I resorted to using a GKE standard cluster.

Walter Cavinaw

02/07/2023, 12:05 AM

Otherwise Autopilot would be the perfect solution.

Keith

02/07/2023, 12:10 AM

Is it working as expected on a standard cluster? Are you passing the

safe-to-evict

flag as part of the job config?

Walter Cavinaw

02/07/2023, 12:11 AM

yes, and the crashing error is gone 🙏 I am adding this to the KubernetesJob customizations: { "op": "add", "path": "/spec/template/metadata", "value": {"annotations": {"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"}}, },

Keith

02/07/2023, 12:15 AM

Amazing! 😎

Walter Cavinaw

02/07/2023, 12:18 AM

tbh, I never actually tried this specific thing on autopilot because i assumed it wouldn't work. but you might as well try it. I see some other Autopilot issues in the issue tracker that seem to be resolved already but are not marked as such. there is a small chance this is already solved but not closed?

176 Views

Open in Slack

Previous Next