Keith
02/03/2023, 8:12 PMCrashed
jobs recently. I am trying to do root cause analysis on it so I start by digging through the logs in Prefect Cloud. What I see is that at some point (it is random) the logs stop and nothing further is output to the UI.
Digging through the logs in Google Logs Explorer I see the same behavior, Prefect container logs stop at the same specific point in time. Inside Google Logs I am also able to see a lot of Kubernetes related logs and am starting to see a pattern but not clear how to fix it.
• Roughly 5-10 seconds after the last log this shows up:
◦ INFO 2023-02-03T19:18:11Z [resource.labels.nodeName: gk3-prefect-autopilot-cl-nap-ji2s72nv-db29cac6-hxzc] marked the node as toBeDeleted/unschedulable
• Quickly followed by:
◦ INFO 2023-02-03T19:18:11Z [resource.labels.clusterName: prefect-autopilot-cluster-1] Scale-down: removing node gk3-prefect-autopilot-cl-nap-ji2s72nv-db29cac6-hxzc, utilization: {0.5538631957906397 0.1841863664058054 0 cpu 0.5538631957906397}, pods to reschedule: adorable-axolotl-d8k8c-6dx5c
◦ INFO 2023-02-03T19:18:38Z [resource.labels.clusterName: prefect-autopilot-cluster-1] Scale-down: node gk3-prefect-autopilot-cl-nap-ji2s72nv-db29cac6-hxzc removed with drain
• GKE tries to reschedule the job but it fails with the following, which is when Prefect alerts for the Crashed
state:
◦ INFO 2023-02-03T19:18:11Z [resource.labels.podName: adorable-axolotl-d8k8c-6dx5c] deleting pod for node scale down
◦ ERROR 2023-02-03T19:18:19.215934101Z [resource.labels.containerName: prefect-job] 19:18:19.214 | INFO | prefect.engine - Engine execution of flow run '8ca83100-dcc3-46d5-91be-f342b19b45a9' aborted by orchestrator: This run cannot transition to the RUNNING state from the RUNNING state.
This appears to be happening on jobs randomly and leads me to believe that GKE believes the cluster is overprovisioned so it is trying to reduce the cluster size and move jobs around, but jobs can't be moved in the middle of execution and Crash/Fail. I am also curious if this is due to resource sizing, but I am not seeing any issues with the jobs I have been troubleshooting with insufficient resource
problems. They all typically state the following in the containerStatuses
leaf of the JSON element with the following message:
state: {
terminated: {
containerID: "<containerd://aac705>"
exitCode: 143
finishedAt: "2023-02-03T19:18:19Z"
reason: "Error"
startedAt: "2023-02-03T19:16:52Z"
}}
Any incite would be greatly appreciated!Walter Cavinaw
02/03/2023, 10:05 PMKeith
02/04/2023, 3:12 AMout of curiosity how many flows are you running when this happens?We had no issues with this up until ~3 weeks ago when the number of our jobs went up much higher. It happens more often when there are more jobs running at the same time but I have also observed it happening when only 2 jobs are running.
Walter Cavinaw
02/04/2023, 3:15 AMKeith
02/06/2023, 9:39 PMdo you mind also making a comment on that issue tracker and clicking the +1 in the top right.Done, thank you for leading this!
it might also be worth trying to add emptyDir local storage.Have you tried this?
Walter Cavinaw
02/07/2023, 12:04 AMKeith
02/07/2023, 12:10 AMsafe-to-evict
flag as part of the job config?Walter Cavinaw
02/07/2023, 12:11 AMKeith
02/07/2023, 12:15 AMWalter Cavinaw
02/07/2023, 12:18 AM