<@ULVA73B9P> what is the main cause of the followi...
# ask-marvin
s
@Marvin what is the main cause of the following Kubernetes error when a flow is running on a Prefect workpool?
Copy code
kubernetes.client.exceptions.ApiException: (410) Reason: Expired: too old resource version: 6563880465 (6565397753)
m
thought for 45.3 seconds
Short answer: it means the Kubernetes watch your worker is using fell behind and tried to resume with an expired resourceVersion. The API server had already compacted/rotated past that version (due to etcd compaction or lots of updates), so it returns 410 Gone: “too old resource version.” Typical triggers: - High churn in the watched resources (e.g., many Pod/Job updates/deletes) - A long‑lived or briefly disconnected watch that resumes too far behind - API server watch cache/etcd compaction rolling past the requested version This isn’t Prefect-specific and usually isn’t fatal—the client should re-list and re-establish the watch. If you’re seeing it frequently, reduce churn in the namespace, narrow your label selectors, or ensure stable network connectivity for the worker. References: - Kubernetes: Efficient detection of changes - Kubernetes: Watching resources
s
@Marvin this is the full error stack
Copy code
An error occurred while monitoring flow run. The flow run will not be marked as failed, but an issue may have occurred.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect/workers/base.py", line 896, in _submit_run_and_capture_errors
    result = await self.run(
             ^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 599, in run
    status_code = await run_sync_in_worker_thread(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect/utilities/asyncutils.py", line 95, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
           ^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/anyio/_backends/_asyncio.py", line 807, in run
    result = context.run(func, *args)
             ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 1028, in _watch_job
    for event in self._job_events(
  File "/usr/local/lib/python3.11/site-packages/kubernetes/watch/watch.py", line 195, in stream
    raise client.rest.ApiException(
kubernetes.client.exceptions.ApiException: (410)
Reason: Expired: too old resource version: 6563880465 (6565397753)
m
thought for 27.5 seconds
Thanks for the stack trace — that pinpoints it. Root cause: - Your Prefect Kubernetes worker is “watching” the Kubernetes Job for your flow run. - The watch tries to resume from a stored
resourceVersion
, but the API server has already compacted past that version (etcd compaction / high update churn), so it returns 410 Gone: “Expired: too old resource version”. - This is a normal Kubernetes watch behavior and not a failure of your flow/pod. What it means for your run: - The flow/pod keeps running; this only affects the worker’s ability to monitor via the current watch stream. - Prefect logs “An error occurred while monitoring flow run… not marked as failed”. Monitoring/log streaming/cancellation visibility may be briefly affected until the client re-lists and re-establishes the watch. Why it happens more often: - High churn in the namespace (many Pod/Job updates/deletes) - Long-lived watches or brief network interruptions causing the client to fall behind - API server/etcd compaction advancing the watch cache beyond the stored
resourceVersion
Mitigations you can try: - Narrow the worker’s watch scope with specific label selectors in your work pool’s job template so it only watches the Jobs it creates. - Reduce churn in the namespace (separate busy system components into different namespaces or tune controllers that generate lots of events). - Ensure stable connectivity from the worker to the API server (avoid network interruptions, long GC pauses, etc.). - Upgrade to the latest
prefect
and `prefect-kubernetes`—there have been improvements in watch handling over time. References: - Kubernetes: Efficient detection of changes - Kubernetes: Watching resources - Related Prefect discussion (2.x but same underlying cause): Prefect issue: 410 Expired watch behavior If you’re seeing this frequently, share your Prefect and
prefect-kubernetes
versions and a rough sense of namespace churn—we can suggest more targeted tweaks.
s
@Marvin what type of resource would the worker or flow run be needing to access when receiving the error?
m
thought for 76.9 seconds
It’s the Kubernetes Job (batch/v1) that Prefect created for your flow run. In your stack trace the error occurs in
prefect_kubernetes.worker._watch_job
while iterating
self._job_events
, which uses a
Watch().stream(...)
against the Job resource. The watch is resuming with an old
resourceVersion
for that Job list, so the API server returns 410 Gone. Notes: - Prefect may also interact with Pods for log streaming, but this particular 410 is from the Job watch. - Job docs: Kubernetes Job