Hi prefecters! I’m having an issue where tasks are...
# ask-community
b
Hi prefecters! I’m having an issue where tasks are getting killed by the ZombieKiller after a short period of time, 30m on yesterday’s flow run.
I’m using prefect cloud and running the agent and jobs in a kubernetes cluster
Here’s the error on the task run page
From the logs:
Also, I’m seeing errors like this in the agent logs
Copy code
ERROR:prism-acuity-aks-agent:Error while managing existing k8s jobs
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 384, in heartbeat
    self.manage_jobs()
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 229, in manage_jobs
    for event in sorted(
TypeError: '<' not supported between instances of 'datetime.datetime' and 'NoneType'
[2022-01-12 08:44:23,504] ERROR - prism-acuity-aks-agent | Error while managing existing k8s jobs
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 384, in heartbeat
    self.manage_jobs()
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 229, in manage_jobs
    for event in sorted(
TypeError: '<' not supported between instances of 'datetime.datetime' and 'NoneType'
[2022-01-12 08:45:23,605] ERROR - prism-acuity-aks-agent | Error while managing existing k8s jobs
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 384, in heartbeat
    self.manage_jobs()
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 229, in manage_jobs
    for event in sorted(
ERROR:prism-acuity-aks-agent:Error while managing existing k8s jobs
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 384, in heartbeat
    self.manage_jobs()
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 229, in manage_jobs
    for event in sorted(
TypeError: '<' not supported between instances of 'datetime.datetime' and 'NoneType'
TypeError: '<' not supported between instances of 'datetime.datetime' and 'NoneType'
ERROR:prism-acuity-aks-agent:Error while managing existing k8s jobs
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 384, in heartbeat
    self.manage_jobs()
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 229, in manage_jobs
    for event in sorted(
TypeError: '<' not supported between instances of 'datetime.datetime' and 'NoneType'
Agent is running this image
Copy code
prefecthq/prefect:0.14.21-python3.8
a
I think it should be Prefectionists 😄 This is a flow heartbeat issue. Prefect has heartbeats which check if your Flow is alive. If Prefect didn’t have heartbeats, flows that lost communication and die would permanently be shown as Running in the UI. There are two possible reasons why your flow runs deployed as Kubernetes jobs may loose heartbeat and eventually get killed by the Zombie killer process: 1. If you have a long running job 2. If you run out of memory What you can try is: 1. Check if upgrading to a more recent Prefect version helps mitigate the issue 2. Change the flow heartbeat mode to thread (default is process) by adding this env variable
Copy code
from prefect.run_configs import KubernetesRun
flow.run_config = KubernetesRun(env={"PREFECT__CLOUD__HEARTBEAT_MODE": "thread"})
Option 3. Try allocating more memory in your KubernetesRun:
Copy code
with Flow(
        FLOW_NAME,
        storage=STORAGE,
        run_config=KubernetesRun(
            labels=["k8s"],
            cpu_request=0.5,
            memory_request="2Gi", # increase whatever you have now
        ),
) as flow:
b
Thanks for the reply! These are long-running databricks jobs. The job actually succeeded in databricks after 12h. I’ll try the things you recommended
👍 1