Prefect Version: 2.7.0 Agent: Azure Kubernetes Pro...
# prefect-community
s
Prefect Version: 2.7.0 Agent: Azure Kubernetes Problem: We just experienced an issue with our Prefect Agent where it crashed. Two flow runs with the same work queue where the concurrency limit is 5 were submitted. These flow runs trigger every 5 minutes. Then, about 15 seconds later, the k8s pod running the Prefect agent crashed. K8s created a new agent pod automatically and our flow runs resumed back to normal. The K8s resource consumption looks normal: CPU Usage 5%, Memory Usage 10%. This issue has occurred about once every two weeks for the last month or so. Between the crashes, everything on the k8s cluster (Prefect agent and Prefect flow runs as k8s jobs) runs normally. Traceback from the agent pod:
Copy code
13:39:09.646 | INFO    | prefect.agent - Submitting flow run '64c00c09-07a6-495b-ada4-a4088e122434'
13:39:09.647 | INFO    | prefect.agent - Submitting flow run 'beb26192-33fb-4988-985b-b497c35573f0'
13:39:10.205 | INFO    | prefect.infrastructure.kubernetes-job - Job 'bright-leech-f77cl': Pod has status 'Pending'.
13:39:10.208 | INFO    | prefect.infrastructure.kubernetes-job - Job 'gleaming-griffin-w55qm': Pod has status 'Pending'.
13:39:10.240 | INFO    | prefect.agent - Completed submission of flow run '64c00c09-07a6-495b-ada4-a4088e122434'
13:39:10.245 | INFO    | prefect.agent - Completed submission of flow run 'beb26192-33fb-4988-985b-b497c35573f0'
13:39:11.736 | INFO    | prefect.infrastructure.kubernetes-job - Job 'bright-leech-f77cl': Pod has status 'Running'.
13:39:11.912 | INFO    | prefect.infrastructure.kubernetes-job - Job 'gleaming-griffin-w55qm': Pod has status 'Running'.
13:39:24.462 | ERROR   | prefect.agent - An error occured while monitoring flow run '92139cd4-23df-4396-9150-9804a17968d6'. The flow run will not be marked as failed, but an issue may have occurred.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/urllib3/response.py", line 761, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/urllib3/response.py", line 444, in _error_catcher
    yield
  File "/usr/local/lib/python3.10/site-packages/urllib3/response.py", line 828, in read_chunked
    self._update_chunk_length()
  File "/usr/local/lib/python3.10/site-packages/urllib3/response.py", line 765, in _update_chunk_length
    raise InvalidChunkLength(self, line)
urllib3.exceptions.InvalidChunkLength: InvalidChunkLength(got length b'', 0 bytes read)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/agent.py", line 417, in _submit_run_and_capture_errors
    result = await infrastructure.run(task_status=task_status)
  File "/usr/local/lib/python3.10/site-packages/prefect/infrastructure/kubernetes.py", line 277, in run
    return await run_sync_in_worker_thread(self._watch_job, job_name)
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 69, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(call, cancellable=True)
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/site-packages/prefect/infrastructure/kubernetes.py", line 527, in _watch_job
    for log in logs.stream():
  File "/usr/local/lib/python3.10/site-packages/urllib3/response.py", line 624, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/usr/local/lib/python3.10/site-packages/urllib3/response.py", line 816, in read_chunked
    with self._error_catcher():
  File "/usr/local/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/usr/local/lib/python3.10/site-packages/urllib3/response.py", line 461, in _error_catcher
    raise ProtocolError("Connection broken: %r" % e, e)

urllib3.exceptions.ProtocolError: ("Connection broken: InvalidChunkLength(got length b'', 0 bytes read)", InvalidChunkLength(got length b'', 0 bytes read))
13:39:29.507 | ERROR   | prefect.infrastructure.kubernetes-job - Job 'gleaming-griffin-w55qm': Job did not complete.
13:39:29.511 | ERROR   | prefect.infrastructure.kubernetes-job - Job 'bright-leech-f77cl': Job did not complete.
z
Hi! This looks like it may be a bug in the
kubernetes
package,
urllib3
package, or an issue in your networking
s
OK, we've used Prefect over a year (v1 originally) and haven't run into this issue until about a month ago, so that's why I thought it might be a Prefect issue. I'll dig into it more and upgrade to 2.7.8 in the meantime.
z
Prefect 1's agent doesn’t monitor the jobs it creates
You can disable
stream_output
on your Kubernetes job block which will avoid trying to read logs like this and may help.
s
I'll give that a try. Thanks!