https://prefect.io logo
j

Johnny

09/29/2020, 8:34 PM
Hello! Having an issue with Kubernetes cluster autoscaler for long running (> 21 min) flows similar to issue 3058. I noticed the issue has been marked "closed". What was the solution?
👀 1
j

Jim Crist-Harif

09/29/2020, 8:51 PM
Hi Johnny, the issue here is that the autoscaler will sometimes kill active jobs, while prefect (currently) doesn't always like to have its jobs killed mid-run. The fix for you would be to modify the job template running on your agent to add the
Copy code
"<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>": "false"
label to the job template (in
job.spec.template.metadata.labels
). This will prevent the autoscaler from evicting active jobs.
To update the agent template, you can copy the default template (https://github.com/PrefectHQ/prefect/blob/master/src/prefect/agent/kubernetes/job_spec.yaml), add the labels (and whatever else you want), save it in the same environment that the agent is running in, then point the agent to it by setting the
YAML_TEMPLATE
environment variable to where the template is located. I recognize that this is a bit complicated, we're currently working to simplify deployment configuration to make customizing deployments a lot simpler (see https://github.com/PrefectHQ/prefect/pull/3333).
j

Johnny

09/29/2020, 8:56 PM
thank you! very very helpful info. I've been stuck on this for 2 days 🙂
👍 1
j

Jim Crist-Harif

09/29/2020, 8:56 PM
Feel free to reach out sooner next time, we're always happy to help. Hope the above tips work for you :).
j

Johnny

09/29/2020, 9:08 PM
will do!