Hi Prefect Community! We are running our infrastructure on GKE Autopilot and have been seeing an in...
k

Keith

over 2 years ago
Hi Prefect Community! We are running our infrastructure on GKE Autopilot and have been seeing an increase in the number of
Crashed
jobs recently. I am trying to do root cause analysis on it so I start by digging through the logs in Prefect Cloud. What I see is that at some point (it is random) the logs stop and nothing further is output to the UI. Digging through the logs in Google Logs Explorer I see the same behavior, Prefect container logs stop at the same specific point in time. Inside Google Logs I am also able to see a lot of Kubernetes related logs and am starting to see a pattern but not clear how to fix it. • Roughly 5-10 seconds after the last log this shows up: ◦
INFO 2023-02-03T19:18:11Z [resource.labels.nodeName: gk3-prefect-autopilot-cl-nap-ji2s72nv-db29cac6-hxzc] marked the node as toBeDeleted/unschedulable
• Quickly followed by: ◦
INFO 2023-02-03T19:18:11Z [resource.labels.clusterName: prefect-autopilot-cluster-1] Scale-down: removing node gk3-prefect-autopilot-cl-nap-ji2s72nv-db29cac6-hxzc, utilization: {0.5538631957906397 0.1841863664058054 0 cpu 0.5538631957906397}, pods to reschedule: adorable-axolotl-d8k8c-6dx5c
INFO 2023-02-03T19:18:38Z [resource.labels.clusterName: prefect-autopilot-cluster-1] Scale-down: node gk3-prefect-autopilot-cl-nap-ji2s72nv-db29cac6-hxzc removed with drain
• GKE tries to reschedule the job but it fails with the following, which is when Prefect alerts for the
Crashed
state: ◦
INFO 2023-02-03T19:18:11Z [resource.labels.podName: adorable-axolotl-d8k8c-6dx5c] deleting pod for node scale down
ERROR 2023-02-03T19:18:19.215934101Z [resource.labels.containerName: prefect-job] 19:18:19.214 | INFO | prefect.engine - Engine execution of flow run '8ca83100-dcc3-46d5-91be-f342b19b45a9' aborted by orchestrator: This run cannot transition to the RUNNING state from the RUNNING state.
This appears to be happening on jobs randomly and leads me to believe that GKE believes the cluster is overprovisioned so it is trying to reduce the cluster size and move jobs around, but jobs can't be moved in the middle of execution and Crash/Fail. I am also curious if this is due to resource sizing, but I am not seeing any issues with the jobs I have been troubleshooting with
insufficient resource
problems. They all typically state the following in the
containerStatuses
leaf of the JSON element with the following message:
state: {
terminated: {
containerID: "<containerd://aac705>"
exitCode: 143
finishedAt: "2023-02-03T19:18:19Z"
reason: "Error"
startedAt: "2023-02-03T19:16:52Z"
}}
Any incite would be greatly appreciated!