<@ULVA73B9P> I am running into this issue on long ...
# ask-marvin
z
@Marvin I am running into this issue on long running jobs, how do I fix it? Is there a setting to have the watch process restart?
Copy code
195, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 1837988160 (1838538580)
m
It looks like you are encountering a
410 Expired
error when using Kubernetes with Prefect, which is related to a resource version being too old. This error typically occurs when the client-side watch mechanism loses sync with the Kubernetes API server, often due to long-running connections. To resolve this issue, consider the following options: 1. Increase the Resync Period: You might be able to modify the resync period of the Kubernetes informer to prevent watches from timing out. 2. Automatically Handle Restarts: Implement logic to handle the reconnection automatically in your Kubernetes client or Prefect task. This can involve catching this specific exception and re-establishing the connection. 3. Check Network Policies: Ensure there are no network policies causing timeouts. 4. Prefect Version Settings: Depending upon your Prefect version, there might be configurations available. However, I couldn't find specific guidance in the Prefect documentation without exact queries. If you could confirm your Prefect version (e.g., 2.x or 3.x), I'll be happy to dive deeper into the respective documentation for any potential settings or recommended practices! Networking issues can sometimes be as pesky as Marvin having a bad day. 😏
n
ah man this pesky 410 resource expired thing 🙄 we should handle this nowadays, what version of
prefect-kubernetes
do you have?
z
Myeah I noticed it's been patched, running 2.19.
Any setting to increase the resync period?
n
hrm how do you mean? cc @Kevin Grismore since you were looking at this
z
iunno, going off what Marvin said in his first point, is the setting to increase the timeout of the resync period exposed to the interface somehow? Or to word this differently, can I change the job parameters to increase this resync period?
Looking at the 2.19.0 release, it does have have the patch for the 410 codes here.
n
do you have more of the trace?
z
Copy code
File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 1031, in _watch_job
    for event in self._job_events(
  File "/usr/local/lib/python3.11/site-packages/kubernetes/watch/watch.py", line 195, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 1837988160 (1838538580)
n
hrmmm that seems like it should get caught? and this is definitely a version after that fix was released?
z
Yep, agreed. I don't think a minor version upgrade would help but I might give that a go as I have no leads for now
k
yeah, I'm not confident that attempt at catching the error ever actually worked
I tried making a simple example where a generator will eventually raise an error, then wrapped returning that generator in a try-except and the error was not caught. only try-excepting the place where the values are actually iterated over catches the error
z
so - the error raised is from the sync k8s client, but the api exception is imported from the async client.
👀 1
I would've expected them to be interchangeable, but the implementation seems to be leaving to be desired
Copy code
>>> from kubernetes_asyncio.client.exceptions import ApiException
>>> from kubernetes_asyncio.client.exceptions import ApiException as exception_asyncio
>>> from kubernetes.client.exceptions import ApiException as exception_sync
>>> exception_sync == exception_asyncio
False
>>> def foobar():
...     raise exception_sync("test")
...
>>> foobar()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in foobar
kubernetes.client.exceptions.ApiException: (test)
Reason: None

>>> try:
...     foobar()
... except exception_asyncio:
...     print('caught')
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "<stdin>", line 2, in foobar
kubernetes.client.exceptions.ApiException: (test)
Reason: None

>>> try:
...     foobar()
... except exception_sync:
...     print('caught')
...
caught
>>>
n
doh, thats a great observation
want to open a PR / issue?
z
I have max one of those every quarter.
😄 1
for the prefect version i'm running, it is
ah wait
Yep Kevin is right - async client is in main, not 2.19
🐐 1
back to the drawing board
Alright well I'll cut bait - is investigation of this issue somewhere on the roadmap?
n
issues are the source of truth in that regard, there doesn't seem to be one so if you could open one that would be useful for tracking this
👍 1
k
I would encourage you to upgrade to the latest k8s worker version. I haven't heard any reports of 410s from people on that one yet
upvote 1
👍 1
s
following up on this thread as well, I'm getting the same error on a long-running flow with a
2.19.3
worker
Copy code
An error occurred while monitoring flow run '5247ca1b-db07-45d6-b86b-3735746e5848'. The flow run will not be marked as failed, but an issue may have occurred.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/workers/base.py", line 908, in _submit_run_and_capture_errors
    result = await self.run(
  File "/usr/local/lib/python3.10/site-packages/prefect_kubernetes/worker.py", line 602, in run
    status_code = await run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 136, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/site-packages/prefect_kubernetes/worker.py", line 1031, in _watch_job
    for event in self._job_events(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/watch/watch.py", line 195, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 1651317987 (1651835806)
is the recommendation here to still update to the latest worker version?
d
@skrawczyk are you still seeing this issue - I started seeing it when I upgraded to the latest version but only with one(1) flow not others.
s
@Dennis L no our issue was resolved after talking to prefect support.
Copy code
the prefect-kubernetes 0.4.6 package should have all of the current bug fixes and should be available on any image running prefect 2.20.11 or later, i.e. 2.20.11-pyhton3.10-kubernetes