hi, I have a flow which was working well for week...
# prefect-kubernetes
a
hi, I have a flow which was working well for weeks. without any changes to the code or deployment or cluster at all, it suddenly started failing to pull the flow (from aws s3). I bumped versions to latest and redeployed, but I still get the same error (pasted in thread). does anybody have ideas what could have happened?
Copy code
02:24:35.639 | DEBUG   | APILogWorkerThread | prefect._internal.concurrency - Running call get(timeout=1.999985247850418) in thread 'APILogWorkerThread'                                   │
│ 02:24:35.640 | DEBUG   | APILogWorkerThread | prefect._internal.concurrency - <WatcherThreadCancelScope, name='get' RUNNING, runtime=0.00> entered                                         │
│ 02:24:37.640 | DEBUG   | APILogWorkerThread | prefect._internal.concurrency - <WatcherThreadCancelScope, name='get' COMPLETED, runtime=2.00> exited                                        │
│ 02:24:37.640 | DEBUG   | APILogWorkerThread | prefect._internal.concurrency - Encountered exception in call get(timeout=1.999985247850418)                                                 │
│ Traceback (most recent call last):                                                                                                                                                         │
│   File "/usr/local/lib/python3.11/site-packages/prefect/_internal/concurrency/calls.py", line 316, in _run_sync                                                                            │
│     result = self.fn(*self.args, **self.kwargs)                                                                                                                                            │
│              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                            │
│   File "/usr/local/lib/python3.11/queue.py", line 179, in get                                                                                                                              │
│     raise Empty                                                                                                                                                                            │
│ _queue.Empty                                                                                                                                                                               │
│ 02:24:37.641 | DEBUG   | APILogWorkerThread | prefect._internal.concurrency - Running call get(timeout=1.9999849442392588) in thread 'APILogWorkerThread'                                  │
│ 02:24:37.641 | DEBUG   | APILogWorkerThread | prefect._internal.concurrency - <WatcherThreadCancelScope, name='get' RUNNING, runtime=0.00> entered                                         │
│ 02:24:37.778 | DEBUG   | prefect.worker.kubernetes.kubernetesworker 4b7b411e-7e51-4283-88c2-e2b23c4a447c - Querying for flow runs scheduled before 2023-06-16T02:24:47.778695+00:00        │
│ 02:24:37.794 | DEBUG   | prefect.worker.kubernetes.kubernetesworker 4b7b411e-7e51-4283-88c2-e2b23c4a447c - Discovered 0 scheduled_flow_runs                                                │
│ 02:24:39.641 | DEBUG   | APILogWorkerThread | prefect._internal.concurrency - <WatcherThreadCancelScope, name='get' COMPLETED, runtime=2.00> exited                                        │
│ 02:24:39.641 | DEBUG   | APILogWorkerThread | prefect._internal.concurrency - Encountered exception in call get(timeout=1.9999849442392588)                                                │
│ Traceback (most recent call last):                                                                                                                                                         │
│   File "/usr/local/lib/python3.11/site-packages/prefect/_internal/concurrency/calls.py", line 316, in _run_sync                                                                            │
│     result = self.fn(*self.args, **self.kwargs)                                                                                                                                            │
│              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                            │
│   File "/usr/local/lib/python3.11/queue.py", line 179, in get                                                                                                                              │
│     raise Empty                                                                                                                                                                            │
│ _queue.Empty
here's a different exception that occurred
Copy code
Worker 'KubernetesWorker c0671193-c224-4512-9296-f05eaa7a9915' submitting flow run '1e389a86-ac45-44d8-8a73-98478a10e722'
10:38:02 PM
prefect.flow_runs.worker

Creating Kubernetes job...
10:38:03 PM
prefect.flow_runs.worker

Failed to submit flow run '1e389a86-ac45-44d8-8a73-98478a10e722' to infrastructure.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 622, in _create_job
    job = batch_client.create_namespaced_job(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job
    return self.create_namespaced_job_with_http_info(namespace, body, **kwargs)  # noqa: E501
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info
    return self.api_client.call_api(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
                    ^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/api_client.py", line 391, in request
    return <http://self.rest_client.POST|self.rest_client.POST>(url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/rest.py", line 276, in POST
    return self.request("POST", url,
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes/client/rest.py", line 235, in request
    raise ApiException(http_resp=r)
kubernetes.client.exceptions.ApiException: (422)
Reason: Unprocessable Entity
HTTP response headers: HTTPHeaderDict({'Audit-Id': '7ced83d0-1a12-441a-bbf5-54de297d08e3', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '402382f4-f741-49b9-bbc2-c68f89e7e5aa', 'X-Kubernetes-Pf-Prioritylevel-Uid': '179087e6-521c-4406-9ba1-845b5684e577', 'Date': 'Fri, 16 Jun 2023 02:38:03 GMT', 'Transfer-Encoding': 'chunked'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Job.batch \"-vfc7h\" is invalid: [metadata.generateName: Invalid value: \"-\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. '<http://example.com|example.com>', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), metadata.name: Invalid value: \"-vfc7h\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. '<http://example.com|example.com>', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*'), spec.template.labels: Invalid value: \"-vfc7h\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')]","reason":"Invalid","details":{"name":"-vfc7h","group":"batch","kind":"Job","causes":[{"reason":"FieldValueInvalid","message":"Invalid value: \"-\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. '<http://example.com|example.com>', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')","field":"metadata.generateName"},{"reason":"FieldValueInvalid","message":"Invalid value: \"-vfc7h\": a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character (e.g. '<http://example.com|example.com>', regex used for validation is '[a-z0-9]([-a-z0-9]*[a-z0-9])?(\\.[a-z0-9]([-a-z0-9]*[a-z0-9])?)*')","field":"metadata.name"},{"reason":"FieldValueInvalid","message":"Invalid value: \"-vfc7h\": a valid label must be an empty string or consist of alphanumeric characters, '-', '_' or '.', and must start and end with an alphanumeric character (e.g. 'MyValue',  or 'my_value',  or '12345', regex used for validation is '(([A-Za-z0-9][-A-Za-z0-9_.]*)?[A-Za-z0-9])?')","field":"spec.template.labels"}]},"code":422}



During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 827, in _submit_run_and_capture_errors
    result = await self.run(
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 500, in run
    job = await run_sync_in_worker_thread(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/asyncutils.py", line 91, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 631, in _create_job
    message += ": " + exc.body["message"]
                      ~~~~~~~~^^^^^^^^^^^
TypeError: string indices must be integers, not 'str'
Copy code
"status":"Failure","message":"Job.batch \"-vfc7h\"
this jumps out, that string
-vfc7h
is repeated a bunch of times and I have no clue what it is
f
Hey @Andy Dienes I have the same problem ( https://prefect-community.slack.com/archives/CL09KU1K7/p1686842996729499 ) and I found an other issue here ( https://prefect-community.slack.com/archives/CM28LL405/p1686605802505009 ). Looks like a bug with the new Prefect release
j
Hey all! Thanks for flagging. I think you’re likely encountering this issue (or a version of it) which we’re looking into. https://github.com/PrefectHQ/prefect/issues/9936
f
Thank you @Jenny. I see an high priority on the Git issue, do you have any idea of the timeline to fix this bug? I'm working on the migration from Prefect 1 to Prefect 2
j
Hi Florent - it's actively being worked on. There's already a fix in our prefect-kubernetes repo and an upcoming release which will update the helm chart. I also had success with adding a name in the work pool base job template as in my screenshot.
j
To close the loop on this, we made a release a few hours ago that resolves this issue: https://github.com/PrefectHQ/prefect/releases/tag/2.10.16 Please update your Kubernetes worker using the Helm chart to version 2023.06.20
thank you 2
f
I just updated and now everything works, thank you for your help ! 🙏
🚀 1