https://prefect.io logo
Title
j

Joshua Massover

09/07/2022, 7:55 PM
While running with a kubernetes agent, I see my flows being killed "randomly". I see various things: • in the agent:
[2022-09-07 19:14:16,051] DEBUG - agent | Deleting job prefect-job-c10e0512
• the killing event in my k8s cluster
apiVersion: v1
count: 1
eventTime: null
firstTimestamp: "2022-09-07T19:13:23Z"
involvedObject:
  apiVersion: v1
  ...
  kind: Pod
  name: prefect-job-c10e0512-9wm4j
  ...
kind: Event
lastTimestamp: "2022-09-07T19:13:23Z"
message: Stopping container prefect-container-prepare
...
reason: Killing
....
type: Normal
• i can see via metrics that i am not oom'ing or doing anything that seems like it should trigger the job being killed • a single flow is running on its own node controlled via the kubernetes cluster autoscaler • i don't see any reason why the cluster autoscaler would be killing this node, and safe-to-evict is set to false. • my application logs always just end, there's nothing suspicious in the logs • there aren't obvious patterns to me. it's not the same job, it's not happening after x amount of minutes. • i've switched to threaded heartbeats, and then most recently turned off heartbeats entirely, and it hasn't fixed it 1. there's a chicken/egg that i'm not sure about. in the agent log, is it issuing a request to the k8s cluster to kill a job? or is it deleted after the kubernetes job kills it for some reason? 2. Any suggestions for how to debug a killed flow in a kubernetes cluster using cluster autoscaling? I can see that it's being killed by the event, it's a herculean task to figure out why it's killed.
1
r

Rob Freedy

09/09/2022, 2:56 PM
Hey Joshua! I believe this article could be helpful. Prefect kills the jobs after flows successfully complete. If this is not expected behavior, it may be worth setting the delete_finished_jobs variable https://discourse.prefect.io/t/how-and-when-does-a-kubernetes-agent-clean-up-kubernetes-jobs-after-flow-run-completion/951
j

Joshua Massover

09/09/2022, 3:08 PM
Thanks! The issue I'm running into is that the jobs are not successfully completing
They are being killed by something, and I can not seem to figure it out.
r

Rob Freedy

09/09/2022, 7:16 PM
Without seeing any logs it is tough to say, but my guess is that it is heartbeat related. This discourse post could also be helpful, specifically with this point: • Heartbeats allow us to track if the flow/task runs are still in progress but again, we don’t have inbound access to your pod to check the state and if something goes wrong within the pod (e.g., someone deletes the pod or the pod crashes or faces some network issues), heartbeats are the only way for Prefect to detect such infrastructure issue without having access to your infrastructure. https://discourse.prefect.io/t/flow-is-failing-with-an-error-message-no-heartbeat-detected-from-the-remote-task/79
j

Joshua Massover

09/09/2022, 9:03 PM
It wasn't heartbeat related. It was just an esoteric k8s/cluster autoscaling issue, nothing to do with prefect. In case anyone searches, I found the log of the terminating instance inside the AWS ASG (auto scaling group). It was trying to rebalance nodes across availability zones, which is something that's not important for these jobs. The "AZRebalance" process can be disabled. It seems to have stopped killing the jobs.
👍 1