< Marvin> I am running into this issue on long running jobs Prefect Community #ask-marvin

<@ULVA73B9P> I am running into this issue on long ...

Zach Marinov

11/22/2024, 4:38 PM

@Marvin I am running into this issue on long running jobs, how do I fix it? Is there a setting to have the watch process restart?

Copy code

195, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 1837988160 (1838538580)

Marvin

11/22/2024, 4:38 PM

It looks like you are encountering a

410 Expired

error when using Kubernetes with Prefect, which is related to a resource version being too old. This error typically occurs when the client-side watch mechanism loses sync with the Kubernetes API server, often due to long-running connections. To resolve this issue, consider the following options: 1. Increase the Resync Period: You might be able to modify the resync period of the Kubernetes informer to prevent watches from timing out. 2. Automatically Handle Restarts: Implement logic to handle the reconnection automatically in your Kubernetes client or Prefect task. This can involve catching this specific exception and re-establishing the connection. 3. Check Network Policies: Ensure there are no network policies causing timeouts. 4. Prefect Version Settings: Depending upon your Prefect version, there might be configurations available. However, I couldn't find specific guidance in the Prefect documentation without exact queries. If you could confirm your Prefect version (e.g., 2.x or 3.x), I'll be happy to dive deeper into the respective documentation for any potential settings or recommended practices! Networking issues can sometimes be as pesky as Marvin having a bad day. 😏

Nate

11/22/2024, 4:42 PM

ah man this pesky 410 resource expired thing 🙄 we should handle this nowadays, what version of

prefect-kubernetes

do you have?

Zach Marinov

11/22/2024, 4:43 PM

Myeah I noticed it's been patched, running 2.19.

Zach Marinov

11/22/2024, 4:43 PM

Any setting to increase the resync period?

Nate

11/22/2024, 4:44 PM

hrm how do you mean? cc @Kevin Grismore since you were looking at this

Zach Marinov

11/22/2024, 4:46 PM

iunno, going off what Marvin said in his first point, is the setting to increase the timeout of the resync period exposed to the interface somehow? Or to word this differently, can I change the job parameters to increase this resync period?

Zach Marinov

11/22/2024, 4:49 PM

Looking at the 2.19.0 release, it does have have the patch for the 410 codes here.

Nate

11/22/2024, 4:51 PM

do you have more of the trace?

Zach Marinov

11/22/2024, 4:52 PM

Copy code

File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 1031, in _watch_job
    for event in self._job_events(
  File "/usr/local/lib/python3.11/site-packages/kubernetes/watch/watch.py", line 195, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 1837988160 (1838538580)

Nate

11/22/2024, 4:54 PM

hrmmm that seems like it should get caught? and this is definitely a version after that fix was released?

Zach Marinov

11/22/2024, 4:55 PM

Yep, agreed. I don't think a minor version upgrade would help but I might give that a go as I have no leads for now

Kevin Grismore

11/22/2024, 4:56 PM

yeah, I'm not confident that attempt at catching the error ever actually worked

Kevin Grismore

11/22/2024, 4:57 PM

I tried making a simple example where a generator will eventually raise an error, then wrapped returning that generator in a try-except and the error was not caught. only try-excepting the place where the values are actually iterated over catches the error

Zach Marinov

11/22/2024, 5:02 PM

so - the error raised is from the sync k8s client, but the api exception is imported from the async client.

👀 1

Zach Marinov

11/22/2024, 5:04 PM

I would've expected them to be interchangeable, but the implementation seems to be leaving to be desired

Zach Marinov

11/22/2024, 5:04 PM

Copy code

>>> from kubernetes_asyncio.client.exceptions import ApiException
>>> from kubernetes_asyncio.client.exceptions import ApiException as exception_asyncio
>>> from kubernetes.client.exceptions import ApiException as exception_sync
>>> exception_sync == exception_asyncio
False
>>> def foobar():
...     raise exception_sync("test")
...
>>> foobar()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in foobar
kubernetes.client.exceptions.ApiException: (test)
Reason: None

>>> try:
...     foobar()
... except exception_asyncio:
...     print('caught')
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "<stdin>", line 2, in foobar
kubernetes.client.exceptions.ApiException: (test)
Reason: None

>>> try:
...     foobar()
... except exception_sync:
...     print('caught')
...
caught
>>>

Nate

11/22/2024, 5:04 PM

doh, thats a great observation

Nate

11/22/2024, 5:04 PM

want to open a PR / issue?

Kevin Grismore

11/22/2024, 5:05 PM

is it? https://github.com/PrefectHQ/prefect/blob/2.19.0/src/integrations/prefect-kubernetes/prefect_kubernetes/worker.py#L117

Zach Marinov

11/22/2024, 5:05 PM

I have max one of those every quarter.

😄 1

Zach Marinov

11/22/2024, 5:05 PM

for the prefect version i'm running, it is

Zach Marinov

11/22/2024, 5:05 PM

ah wait

Zach Marinov

11/22/2024, 5:06 PM

Yep Kevin is right - async client is in main, not 2.19

🐐 1

Zach Marinov

11/22/2024, 5:06 PM

back to the drawing board

Zach Marinov

11/22/2024, 5:55 PM

Alright well I'll cut bait - is investigation of this issue somewhere on the roadmap?

Nate

11/22/2024, 5:56 PM

issues are the source of truth in that regard, there doesn't seem to be one so if you could open one that would be useful for tracking this

👍 1

Kevin Grismore

11/22/2024, 6:38 PM

I would encourage you to upgrade to the latest k8s worker version. I haven't heard any reports of 410s from people on that one yet

upvote 1

👍 1

skrawczyk

01/03/2025, 5:55 PM

following up on this thread as well, I'm getting the same error on a long-running flow with a

2.19.3

worker

skrawczyk

01/03/2025, 5:55 PM

Copy code

An error occurred while monitoring flow run '5247ca1b-db07-45d6-b86b-3735746e5848'. The flow run will not be marked as failed, but an issue may have occurred.
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/prefect/workers/base.py", line 908, in _submit_run_and_capture_errors
    result = await self.run(
  File "/usr/local/lib/python3.10/site-packages/prefect_kubernetes/worker.py", line 602, in run
    status_code = await run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 136, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/site-packages/prefect_kubernetes/worker.py", line 1031, in _watch_job
    for event in self._job_events(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/watch/watch.py", line 195, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 1651317987 (1651835806)

skrawczyk

01/03/2025, 5:55 PM

is the recommendation here to still update to the latest worker version?

Dennis L

01/26/2025, 11:35 PM

@skrawczyk are you still seeing this issue - I started seeing it when I upgraded to the latest version but only with one(1) flow not others.

skrawczyk

01/30/2025, 1:29 AM

@Dennis L no our issue was resolved after talking to prefect support.

Copy code

the prefect-kubernetes 0.4.6 package should have all of the current bug fixes and should be available on any image running prefect 2.20.11 or later, i.e. 2.20.11-pyhton3.10-kubernetes

19 Views

Open in Slack

Previous Next