https://prefect.io logo
Title
s

Shane Satterfield

05/04/2023, 4:36 PM
Our work queues are getting backed up because the Prefect agent is failing silently and not restarting in the Kubernetes Helm deployment. This is the error that we're seeing.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/_internal/concurrency/services.py", line 120, in _run
    async with self._lifespan():
  File "/usr/local/lib/python3.10/contextlib.py", line 199, in __aenter__
    return await anext(self.gen)
  File "/usr/local/lib/python3.10/site-packages/prefect/events/worker.py", line 30, in _lifespan
    async with self._client:
  File "/usr/local/lib/python3.10/site-packages/prefect/events/clients.py", line 118, in __aenter__
    await self._reconnect()
  File "/usr/local/lib/python3.10/site-packages/prefect/events/clients.py", line 136, in _reconnect
    self._websocket = await self._connect.__aenter__()
  File "/usr/local/lib/python3.10/site-packages/websockets/legacy/client.py", line 637, in __aenter__
    return await self
  File "/usr/local/lib/python3.10/site-packages/websockets/legacy/client.py", line 655, in __await_impl_timeout__
    return await self.__await_impl__()
  File "/usr/local/lib/python3.10/site-packages/websockets/legacy/client.py", line 659, in __await_impl__
    _transport, _protocol = await self._create_connection()
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1036, in create_connection
    infos = await self._ensure_resolved(
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 1418, in _ensure_resolved
    return await loop.getaddrinfo(host, port, family=family, type=type,
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 863, in getaddrinfo
    return await self.run_in_executor(
  File "/usr/local/lib/python3.10/asyncio/base_events.py", line 821, in run_in_executor
    executor.submit(func, *args), loop=self)
  File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 167, in submit
    raise RuntimeError('cannot schedule new futures after shutdown')
RuntimeError: cannot schedule new futures after shutdown
Not sure where this error is coming from, but I would expect any error that terminates the Prefect agent should trigger a pod restart. I checked the Helm chart and I don't see a liveness probe there. Is this something that the Prefect team can add? I assume the blocker is that the Prefect agent may not have a good way to check that it's still alive.
z

Zanie

05/04/2023, 4:39 PM
Is this error happening repeatedly? cc @Chris Pickett
This is one of our background workers which should never prevent the process from exiting but there are some bugs we’ve been addressing as we see them.
Regarding a healthcheck, we should definitely add one but have not yet.
s

Shane Satterfield

05/04/2023, 4:43 PM
This only happens once per pod and there are no additional logs after this error message. Looks like it killed the agent process, but the pod is still running since there's no healthcheck. We've run into this issue a few time over the course of the week. We can temporarily resolve it by restarting the pods.
c

Chris Pickett

05/04/2023, 4:47 PM
We had an issue where the thread we spin up to replicate kubernetes events to Prefect Cloud would not exit properly. This could be related to that, so you might try updating to
prefect-kubernetes
0.2.7 that was just released (~30m ago) and see if that resolves the issue. If not it might be, as Zanie points out, an issue with our background workers.
z

Zanie

05/04/2023, 4:50 PM
cc @jawnsy when the agent process is killed shouldn’t the pod exit? This would make some reports I’ve gotten make more sense if it’s not exiting the pod when the process exits.
s

Shane Satterfield

05/04/2023, 4:52 PM
@Chris Pickett Should I install
prefect-kubernetes
into our flow code or is this packaged within the Prefect agent Docker image?
c

Chris Pickett

05/04/2023, 4:53 PM
It’s part of the helm chart, but you’ll need to set an env var:
extraEnvVars:
  - name: EXTRA_PIP_PACKAGES
    value: "prefect-kubernetes==0.2.7"
s

Shane Satterfield

05/04/2023, 4:58 PM
Just deployed this change, but now the agents are all logging this error
16:56:27.361 | INFO    | prefect.agent - Found 1 flow runs awaiting cancellation.
16:56:27.460 | ERROR   | prefect.agent - Failed to get infrastructure for flow run '3f7416c8-18b7-47c9-a28c-5e485f4e0dcc'. Flow run cannot be cancelled.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/agent.py", line 321, in cancel_run
    infrastructure = await self.get_infrastructure(flow_run)
  File "/usr/local/lib/python3.10/site-packages/prefect/agent.py", line 380, in get_infrastructure
    deployment = await self.client.read_deployment(flow_run.deployment_id)
  File "/usr/local/lib/python3.10/site-packages/prefect/client/orchestration.py", line 1485, in read_deployment
    response = await self._client.get(f"/deployments/{deployment_id}")
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1754, in get
    return await self.request(
  File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1530, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/usr/local/lib/python3.10/site-packages/prefect/client/base.py", line 280, in send
    response.raise_for_status()
  File "/usr/local/lib/python3.10/site-packages/prefect/client/base.py", line 137, in raise_for_status
    raise PrefectHTTPStatusError.from_httpx_error(exc) from exc.__cause__
prefect.exceptions.PrefectHTTPStatusError: Client error '405 Method Not Allowed' for url '<https://api.prefect.cloud/api/accounts/3da50c77-9f56-49aa-97f9-c1a5757459c1/workspaces/396ade34-4a2c-4b69-96d9-a24cd0e57449/deployments/None>'
Response: {'detail': 'Method Not Allowed'}
For more information check: <https://httpstatuses.com/405>
c

Chris Pickett

05/04/2023, 5:02 PM
Hmm, that URL looks like it was malformed,
deployments/None
s

Shane Satterfield

05/04/2023, 5:02 PM
I was able to resolve this by going into the Prefect cloud dashboard, locating that flow run, and deleting it manually there. After it was deleted, the agents continued processing flow runs