https://prefect.io logo
Title
k

Keith

02/03/2023, 8:12 PM
Hi Prefect Community! We are running our infrastructure on GKE Autopilot and have been seeing an increase in the number of
Crashed
jobs recently. I am trying to do root cause analysis on it so I start by digging through the logs in Prefect Cloud. What I see is that at some point (it is random) the logs stop and nothing further is output to the UI. Digging through the logs in Google Logs Explorer I see the same behavior, Prefect container logs stop at the same specific point in time. Inside Google Logs I am also able to see a lot of Kubernetes related logs and am starting to see a pattern but not clear how to fix it. • Roughly 5-10 seconds after the last log this shows up: ◦
INFO 2023-02-03T19:18:11Z [resource.labels.nodeName: gk3-prefect-autopilot-cl-nap-ji2s72nv-db29cac6-hxzc] marked the node as toBeDeleted/unschedulable
• Quickly followed by: ◦
INFO 2023-02-03T19:18:11Z [resource.labels.clusterName: prefect-autopilot-cluster-1] Scale-down: removing node gk3-prefect-autopilot-cl-nap-ji2s72nv-db29cac6-hxzc, utilization: {0.5538631957906397 0.1841863664058054 0 cpu 0.5538631957906397}, pods to reschedule: adorable-axolotl-d8k8c-6dx5c
INFO 2023-02-03T19:18:38Z [resource.labels.clusterName: prefect-autopilot-cluster-1] Scale-down: node gk3-prefect-autopilot-cl-nap-ji2s72nv-db29cac6-hxzc removed with drain
• GKE tries to reschedule the job but it fails with the following, which is when Prefect alerts for the
Crashed
state: ◦
INFO 2023-02-03T19:18:11Z [resource.labels.podName: adorable-axolotl-d8k8c-6dx5c] deleting pod for node scale down
ERROR 2023-02-03T19:18:19.215934101Z [resource.labels.containerName: prefect-job] 19:18:19.214 | INFO | prefect.engine - Engine execution of flow run '8ca83100-dcc3-46d5-91be-f342b19b45a9' aborted by orchestrator: This run cannot transition to the RUNNING state from the RUNNING state.
This appears to be happening on jobs randomly and leads me to believe that GKE believes the cluster is overprovisioned so it is trying to reduce the cluster size and move jobs around, but jobs can't be moved in the middle of execution and Crash/Fail. I am also curious if this is due to resource sizing, but I am not seeing any issues with the jobs I have been troubleshooting with
insufficient resource
problems. They all typically state the following in the
containerStatuses
leaf of the JSON element with the following message:
state: {
terminated: {
containerID: "<containerd://aac705>"
exitCode: 143
finishedAt: "2023-02-03T19:18:19Z"
reason: "Error"
startedAt: "2023-02-03T19:16:52Z"
}}
Any incite would be greatly appreciated!
w

Walter Cavinaw

02/03/2023, 10:05 PM
i have the same setup and similar problem. Hard to debug as well because it happens randomly!
This stackoverflow post was helpful to shine some light on it. https://stackoverflow.com/questions/71509986/with-gke-autopilot-banning-the-cluster-autoscaler-kubernetes-io-safe-to-evict-fa out of curiosity how many flows are you running when this happens? I only notice this error when we have many things running at the same time.
k

Keith

02/04/2023, 3:12 AM
out of curiosity how many flows are you running when this happens?
We had no issues with this up until ~3 weeks ago when the number of our jobs went up much higher. It happens more often when there are more jobs running at the same time but I have also observed it happening when only 2 jobs are running.
Thanks for the SO post, I had found that this morning as well and am working on a post to this thread b/c I'm not sure that setting will help. But maybe the setting in your issue tracker would be helpful.
w

Walter Cavinaw

02/04/2023, 3:15 AM
it might also be worth trying to add emptyDir local storage. In these docs it says that a pod will not get scaled down if it has local storage attached: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-types-of-pods-can-prevent-ca-from-removing-a-node
Just FYI I've paid to escalate the issue using GCP support. It was 500. This is quite disruptive for us, so if you can, do you mind also making a comment on that issue tracker and clicking the +1 in the top right. The more attention this gets the better. maybe they can fix it soon.
k

Keith

02/06/2023, 9:39 PM
do you mind also making a comment on that issue tracker and clicking the +1 in the top right.
Done, thank you for leading this!
I am also curious if anyone on the Prefect team has a comment on this?
it might also be worth trying to add emptyDir local storage.
Have you tried this?
w

Walter Cavinaw

02/07/2023, 12:04 AM
yes I tried this. It didn't work. The only thing I didn't try is add a pod disruption budget because it seems like a lot of work and not guaranteed to work because there are limits to this.
:face_exhaling: 1
I resorted to using a GKE standard cluster.
Otherwise Autopilot would be the perfect solution.
k

Keith

02/07/2023, 12:10 AM
Is it working as expected on a standard cluster? Are you passing the
safe-to-evict
flag as part of the job config?
w

Walter Cavinaw

02/07/2023, 12:11 AM
yes, and the crashing error is gone 🙏 I am adding this to the KubernetesJob customizations: { "op": "add", "path": "/spec/template/metadata", "value": {"annotations": {"cluster-autoscaler.kubernetes.io/safe-to-evict": "false"}}, },
k

Keith

02/07/2023, 12:15 AM
Amazing! 😎
w

Walter Cavinaw

02/07/2023, 12:18 AM
tbh, I never actually tried this specific thing on autopilot because i assumed it wouldn't work. but you might as well try it. I see some other Autopilot issues in the issue tracker that seem to be resolved already but are not marked as such. there is a small chance this is already solved but not closed?