Hi team, I've been successfully using Prefect with...
# prefect-kubernetes
c
Hi team, I've been successfully using Prefect with Kubernetes for a while now. But over the past two weeks, I've encountered intermittent errors during flow deployments. Sometimes everything works fine, and other times, all tasks fail to deploy. The error message typically looks like the one below. It seems to be an error with a connection refused to warden-mutating.common-webhooks.networking.gke.io. Could it be caused by an overload with too many tasks being scheduled? I’d love to hear your thoughts if you have any insights into what might be causing these errors and how to solve it
Copy code
Failed to submit flow run '6bddeb25-71d8-4add-92de-41d200887583' to infrastructure.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 632, in _create_job
    job = batch_client.create_namespaced_job(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job
    return self.create_namespaced_job_with_http_info(namespace, body, **kwargs)  # noqa: E501
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info
    return self.api_client.call_api(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
                    ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api_client.py", line 391, in request
    return <http://self.rest_client.POST|self.rest_client.POST>(url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/rest.py", line 276, in POST
    return self.request("POST", url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/rest.py", line 235, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'b3aa49b8-60fd-4da9-b6c5-55d47dbc1f6d', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Warning': '299 - "unknown field \\"spec.template.spec.completions\\"", 299 - "unknown field \\"spec.template.spec.parallelism\\""', 'X-Kubernetes-Pf-Flowschema-Uid': '312d4f4a-edd6-4ea6-aee4-cb1ea8654799', 'X-Kubernetes-Pf-Prioritylevel-Uid': '3f4f1eb8-f614-433c-af7b-3844f105a5a8', 'Date': 'Thu, 11 Jul 2024 15:00:23 GMT', 'Content-Length': '619'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"<http://warden-mutating.common-webhooks.networking.gke.io|warden-mutating.common-webhooks.networking.gke.io>\": failed to call webhook: Post \"<https://localhost:5443/webhook/warden-mutating?timeout=10s>\": dial tcp [::1]:5443: connect: connection refused","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"<http://warden-mutating.common-webhooks.networking.gke.io|warden-mutating.common-webhooks.networking.gke.io>\": failed to call webhook: Post \"<https://localhost:5443/webhook/warden-mutating?timeout=10s>\": dial tcp [::1]:5443: connect: connection refused"}]},"code":500}


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 834, in _submit_run_and_capture_errors
    result = await self.run(
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 510, in run
    job = await run_sync_in_worker_thread(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 641, in _create_job
    message += ": " + exc.body["message"]
                      ~~~~~~~~^^^^^^^^^^^
TypeError: string indices must be integers, not 'str'
n
hi @Clément Frison - would you able to retrieve and share the version of
prefect-kubernetes
you're using when you encounter this error?
c
I don’t have the "prefect-kubernetes" package installed, I'm only using prefect 2.16.9 And I’m deploying with myflow.deploy( name=…, parameters=deployment_params, _work_pool_name_="myk8s-work-pool", image=myimage, build=False, push=False, _job_variables_=_job_variables_, schedule=schedule )
n
File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 641, in _create_job
sorry, are you not running a kubernetes worker here?
you may not have
prefect-kubernetes
installed locally, but you must have it installed to run a k8s worker. thats the version im curious about
c
mmm I don't remember exactly but I think I followed this https://docs.prefect.io/latest/guides/deployment/kubernetes/#deploy-a-worker-using-helm How could I check the "prefect-kubernetes" version used in my kubernetes worker? By doing a "kubectl describe" I could extract: From the Label section: •
prefect-version=2.10.20-python3.11-kubernetes
From the Containers section: •
Image: prefecthq/prefect:2.10.20-python3.11-kubernetes
Is this what you are looking for?
n
are you describing the worker pod there or the flow run pod?
c
the "prefect-worker" the flow-run instantly crashed so I think I can't access it
if I look into a flow run pod that did run correctly, I don't have the same information, I can just infer that it was based on my docker image which is based on "FROM prefecthq/prefect:2-python3.10"
FYI, if anyone experiences the same type of issue, it seems the problem was the lack of cleanup for completed jobs. Manually deleting all the completed jobs resolved the issue for now.