Joshua Massover
09/07/2022, 7:55 PM[2022-09-07 19:14:16,051] DEBUG - agent | Deleting job prefect-job-c10e0512
• the killing event in my k8s cluster
apiVersion: v1
count: 1
eventTime: null
firstTimestamp: "2022-09-07T19:13:23Z"
involvedObject:
apiVersion: v1
...
kind: Pod
name: prefect-job-c10e0512-9wm4j
...
kind: Event
lastTimestamp: "2022-09-07T19:13:23Z"
message: Stopping container prefect-container-prepare
...
reason: Killing
....
type: Normal
• i can see via metrics that i am not oom'ing or doing anything that seems like it should trigger the job being killed
• a single flow is running on its own node controlled via the kubernetes cluster autoscaler
• i don't see any reason why the cluster autoscaler would be killing this node, and safe-to-evict is set to false.
• my application logs always just end, there's nothing suspicious in the logs
• there aren't obvious patterns to me. it's not the same job, it's not happening after x amount of minutes.
• i've switched to threaded heartbeats, and then most recently turned off heartbeats entirely, and it hasn't fixed it
1. there's a chicken/egg that i'm not sure about. in the agent log, is it issuing a request to the k8s cluster to kill a job? or is it deleted after the kubernetes job kills it for some reason?
2. Any suggestions for how to debug a killed flow in a kubernetes cluster using cluster autoscaling? I can see that it's being killed by the event, it's a herculean task to figure out why it's killed.Rob Freedy
09/09/2022, 2:56 PMJoshua Massover
09/09/2022, 3:08 PMRob Freedy
09/09/2022, 7:16 PMJoshua Massover
09/09/2022, 9:03 PM