< Marvin> what is the main cause of the following Kubernetes Prefect Community #ask-marvin

<@ULVA73B9P> what is the main cause of the followi...

Scott Pedersen

09/29/2025, 12:03 AM

@Marvin what is the main cause of the following Kubernetes error when a flow is running on a Prefect workpool?

Copy code

kubernetes.client.exceptions.ApiException: (410) Reason: Expired: too old resource version: 6563880465 (6565397753)

Marvin

09/29/2025, 12:03 AM

✅ thought for 45.3 seconds

Marvin

09/29/2025, 12:04 AM

Short answer: it means the Kubernetes watch your worker is using fell behind and tried to resume with an expired resourceVersion. The API server had already compacted/rotated past that version (due to etcd compaction or lots of updates), so it returns 410 Gone: “too old resource version.” Typical triggers: - High churn in the watched resources (e.g., many Pod/Job updates/deletes) - A long‑lived or briefly disconnected watch that resumes too far behind - API server watch cache/etcd compaction rolling past the requested version This isn’t Prefect-specific and usually isn’t fatal—the client should re-list and re-establish the watch. If you’re seeing it frequently, reduce churn in the namespace, narrow your label selectors, or ensure stable network connectivity for the worker. References: - Kubernetes: Efficient detection of changes - Kubernetes: Watching resources

Scott Pedersen

09/29/2025, 12:07 AM

@Marvin this is the full error stack

Copy code

An error occurred while monitoring flow run. The flow run will not be marked as failed, but an issue may have occurred.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 896, in _submit_run_and_capture_errors
    result = await self.run(
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 599, in run
    status_code = await run_sync_in_worker_thread(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/asyncutils.py", line 95, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 1028, in _watch_job
    for event in self._job_events(
  File "/usr/local/lib/python3.11/site-packages/kubernetes/watch/watch.py", line 195, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 6563880465 (6565397753)

Marvin

09/29/2025, 12:07 AM

✅ thought for 27.5 seconds

Marvin

09/29/2025, 12:07 AM

Thanks for the stack trace — that pinpoints it. Root cause: - Your Prefect Kubernetes worker is “watching” the Kubernetes Job for your flow run. - The watch tries to resume from a stored

resourceVersion

, but the API server has already compacted past that version (etcd compaction / high update churn), so it returns 410 Gone: “Expired: too old resource version”. - This is a normal Kubernetes watch behavior and not a failure of your flow/pod. What it means for your run: - The flow/pod keeps running; this only affects the worker’s ability to monitor via the current watch stream. - Prefect logs “An error occurred while monitoring flow run… not marked as failed”. Monitoring/log streaming/cancellation visibility may be briefly affected until the client re-lists and re-establishes the watch. Why it happens more often: - High churn in the namespace (many Pod/Job updates/deletes) - Long-lived watches or brief network interruptions causing the client to fall behind - API server/etcd compaction advancing the watch cache beyond the stored

resourceVersion

Mitigations you can try: - Narrow the worker’s watch scope with specific label selectors in your work pool’s job template so it only watches the Jobs it creates. - Reduce churn in the namespace (separate busy system components into different namespaces or tune controllers that generate lots of events). - Ensure stable connectivity from the worker to the API server (avoid network interruptions, long GC pauses, etc.). - Upgrade to the latest

prefect

and `prefect-kubernetes`—there have been improvements in watch handling over time. References: - Kubernetes: Efficient detection of changes - Kubernetes: Watching resources - Related Prefect discussion (2.x but same underlying cause): Prefect issue: 410 Expired watch behavior If you’re seeing this frequently, share your Prefect and

prefect-kubernetes

versions and a rough sense of namespace churn—we can suggest more targeted tweaks.

Scott Pedersen

09/29/2025, 1:09 AM

@Marvin what type of resource would the worker or flow run be needing to access when receiving the error?

Marvin

09/29/2025, 1:09 AM

✅ thought for 76.9 seconds

Marvin

09/29/2025, 1:11 AM

It’s the Kubernetes Job (batch/v1) that Prefect created for your flow run. In your stack trace the error occurs in

prefect_kubernetes.worker._watch_job

while iterating

self._job_events

, which uses a

Watch().stream(...)

against the Job resource. The watch is resuming with an old

resourceVersion

for that Job list, so the API server returns 410 Gone. Notes: - Prefect may also interact with Pods for log streaming, but this particular 410 is from the Job watch. - Job docs: Kubernetes Job

3 Views

Open in Slack

Previous Next