Hi prefecters I m having an issue where tasks are getting ki Prefect Community #ask-community

Hi prefecters! I’m having an issue where tasks are...

brian

01/13/2022, 3:48 PM

Hi prefecters! I’m having an issue where tasks are getting killed by the ZombieKiller after a short period of time, 30m on yesterday’s flow run.

brian

01/13/2022, 3:48 PM

I’m using prefect cloud and running the agent and jobs in a kubernetes cluster

brian

01/13/2022, 3:49 PM

Here’s the error on the task run page

brian

01/13/2022, 3:49 PM

From the logs:

brian

01/13/2022, 3:50 PM

Also, I’m seeing errors like this in the agent logs

Copy code

ERROR:prism-acuity-aks-agent:Error while managing existing k8s jobs
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 384, in heartbeat
    self.manage_jobs()
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 229, in manage_jobs
    for event in sorted(
TypeError: '<' not supported between instances of 'datetime.datetime' and 'NoneType'
[2022-01-12 08:44:23,504] ERROR - prism-acuity-aks-agent | Error while managing existing k8s jobs
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 384, in heartbeat
    self.manage_jobs()
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 229, in manage_jobs
    for event in sorted(
TypeError: '<' not supported between instances of 'datetime.datetime' and 'NoneType'
[2022-01-12 08:45:23,605] ERROR - prism-acuity-aks-agent | Error while managing existing k8s jobs
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 384, in heartbeat
    self.manage_jobs()
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 229, in manage_jobs
    for event in sorted(
ERROR:prism-acuity-aks-agent:Error while managing existing k8s jobs
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 384, in heartbeat
    self.manage_jobs()
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 229, in manage_jobs
    for event in sorted(
TypeError: '<' not supported between instances of 'datetime.datetime' and 'NoneType'
TypeError: '<' not supported between instances of 'datetime.datetime' and 'NoneType'
ERROR:prism-acuity-aks-agent:Error while managing existing k8s jobs
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 384, in heartbeat
    self.manage_jobs()
  File "/usr/local/lib/python3.8/site-packages/prefect/agent/kubernetes/agent.py", line 229, in manage_jobs
    for event in sorted(
TypeError: '<' not supported between instances of 'datetime.datetime' and 'NoneType'

brian

01/13/2022, 3:57 PM

Agent is running this image

Copy code

prefecthq/prefect:0.14.21-python3.8

Anna Geller

01/13/2022, 4:22 PM

I think it should be Prefectionists 😄 This is a flow heartbeat issue. Prefect has heartbeats which check if your Flow is alive. If Prefect didn’t have heartbeats, flows that lost communication and die would permanently be shown as Running in the UI. There are two possible reasons why your flow runs deployed as Kubernetes jobs may loose heartbeat and eventually get killed by the Zombie killer process: 1. If you have a long running job 2. If you run out of memory What you can try is: 1. Check if upgrading to a more recent Prefect version helps mitigate the issue 2. Change the flow heartbeat mode to thread (default is process) by adding this env variable

Copy code

from prefect.run_configs import KubernetesRun
flow.run_config = KubernetesRun(env={"PREFECT__CLOUD__HEARTBEAT_MODE": "thread"})

Anna Geller

01/13/2022, 4:24 PM

Option 3. Try allocating more memory in your KubernetesRun:

Copy code

with Flow(
        FLOW_NAME,
        storage=STORAGE,
        run_config=KubernetesRun(
            labels=["k8s"],
            cpu_request=0.5,
            memory_request="2Gi", # increase whatever you have now
        ),
) as flow:

brian

01/13/2022, 4:38 PM

Thanks for the reply! These are long-running databricks jobs. The job actually succeeded in databricks after 12h. I’ll try the things you recommended

👍 1

15 Views

Open in Slack

Previous Next