Hi team, I've been successfully using Prefect with Kubernetes for a while now. But over the past two...
c

Clément Frison

about 1 year ago
Hi team, I've been successfully using Prefect with Kubernetes for a while now. But over the past two weeks, I've encountered intermittent errors during flow deployments. Sometimes everything works fine, and other times, all tasks fail to deploy. The error message typically looks like the one below. It seems to be an error with a connection refused to warden-mutating.common-webhooks.networking.gke.io. Could it be caused by an overload with too many tasks being scheduled? I’d love to hear your thoughts if you have any insights into what might be causing these errors and how to solve it
Failed to submit flow run '6bddeb25-71d8-4add-92de-41d200887583' to infrastructure.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 632, in _create_job
    job = batch_client.create_namespaced_job(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job
    return self.create_namespaced_job_with_http_info(namespace, body, **kwargs)  # noqa: E501
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info
    return self.api_client.call_api(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
                    ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api_client.py", line 391, in request
    return <http://self.rest_client.POST|self.rest_client.POST>(url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/rest.py", line 276, in POST
    return self.request("POST", url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/rest.py", line 235, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (500)
Reason: Internal Server Error
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'b3aa49b8-60fd-4da9-b6c5-55d47dbc1f6d', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Warning': '299 - "unknown field \\"spec.template.spec.completions\\"", 299 - "unknown field \\"spec.template.spec.parallelism\\""', 'X-Kubernetes-Pf-Flowschema-Uid': '312d4f4a-edd6-4ea6-aee4-cb1ea8654799', 'X-Kubernetes-Pf-Prioritylevel-Uid': '3f4f1eb8-f614-433c-af7b-3844f105a5a8', 'Date': 'Thu, 11 Jul 2024 15:00:23 GMT', 'Content-Length': '619'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Internal error occurred: failed calling webhook \"<http://warden-mutating.common-webhooks.networking.gke.io|warden-mutating.common-webhooks.networking.gke.io>\": failed to call webhook: Post \"<https://localhost:5443/webhook/warden-mutating?timeout=10s>\": dial tcp [::1]:5443: connect: connection refused","reason":"InternalError","details":{"causes":[{"message":"failed calling webhook \"<http://warden-mutating.common-webhooks.networking.gke.io|warden-mutating.common-webhooks.networking.gke.io>\": failed to call webhook: Post \"<https://localhost:5443/webhook/warden-mutating?timeout=10s>\": dial tcp [::1]:5443: connect: connection refused"}]},"code":500}


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 834, in _submit_run_and_capture_errors
    result = await self.run(
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 510, in run
    job = await run_sync_in_worker_thread(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 641, in _create_job
    message += ": " + exc.body["message"]
                      ~~~~~~~~^^^^^^^^^^^
TypeError: string indices must be integers, not 'str'