Daniel Topor
09/04/2025, 12:59 PMJob 'subtle-porpoise-knwg7' was removed.
Traceback (most recent call last):
File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 1218, in _get_job
job = await batch_client.read_namespaced_job(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/kubernetes_asyncio/client/api_client.py", line 192, in __call_api
raise e
File "/usr/local/lib/python3.11/site-packages/kubernetes_asyncio/client/api_client.py", line 185, in __call_api
response_data = await self.request(
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/kubernetes_asyncio/client/rest.py", line 212, in GET
return (await self.request("GET", url,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/kubernetes_asyncio/client/rest.py", line 206, in request
raise ApiException(http_resp=r)
kubernetes_asyncio.client.exceptions.ApiException: (504)
Reason: Gateway Timeout
HTTP response headers: <CIMultiDictProxy('Audit-Id': '90838263-f305-49bc-bd98-53b2e810f0ca', 'Cache-Control': 'no-cache, private', 'Date': 'Thu, 04 Sep 2025 08:07:54 GMT', 'Content-Length': '152', 'Content-Type': 'text/plain; charset=utf-8')>
HTTP response body: {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within the allotted timeout","reason":"Timeout","details":{},"code":504}
Marvin
09/04/2025, 12:59 PMMarvin
09/04/2025, 1:05 PMread_namespaced_job
typically comes from the Kubernetes API server (or something in between, e.g., Azure networking) not responding in time.
Given your symptoms (two jobs crashed, then no pods created and 504s), the likely culprits are:
- Intermittent connectivity or timeouts between your worker and the AKS API server (LB idle timeout, private cluster networking, firewall, or transient AKS control plane load)
- API server under load/slow (etcd pressure, many CRDs/controllers like StackGres adding load)
- RBAC or namespace issues are less likely here (you’d usually see 403/404, not 504), but we should still verify.
Quick actions to unblock
1) Restart the Kubernetes worker
- If the worker lost a stable connection to the API server, a restart often clears it up. If you run the worker outside the cluster, consider running it inside AKS (in-cluster) to remove LB/firewall hops.
2) Upgrade to the latest versions
- Upgrade both Prefect and prefect-kubernetes to pick up network robustness improvements:
- pip install -U prefect prefect-kubernetes
- Docs: Kubernetes worker guide and Workers concepts
3) If the worker runs outside the cluster, check Azure idle timeouts
- Azure Load Balancer idle timeout can drop long-lived HTTP connections causing timeouts. Increase the idle timeout (e.g., 15–30 minutes) or run the worker in-cluster to avoid this path.
- Docs: Azure LB idle timeout
4) Verify AKS control plane/API server health
- From a place that can reach the API server (e.g., a debugging pod in the cluster):
- kubectl get --raw /readyz?verbose
- kubectl get events -A --sort-by=.metadata.creationTimestamp | tail -n 200
to see recent API or admission failures
- If you observe periodic slowness/timeouts across the board (kubectl also stalls), this is an AKS control plane/API server load issue.
5) Check for Job creation or cleanup behavior
- Verify whether Jobs are being created at all when new flow runs start:
- kubectl get jobs -n <your-namespace>
- kubectl get events -n <your-namespace> --sort-by=.metadata.creationTimestamp | tail -n 100
- If Jobs are immediately deleted after finishing/failing, look for very low ttlSecondsAfterFinished
on your job spec. TTL docs: https://kubernetes.io/docs/concepts/workloads/controllers/job/#ttl-mechanism-for-finished-jobs
- Prefect does clean up jobs it creates; if you’re debugging, you can temporarily keep them around by turning off cleanup/TTL in your job template (share your job template if you’d like me to review).
6) Sanity-check RBAC and namespace
- Make sure the worker’s service account can read/create Jobs/Pods in the target namespace:
- kubectl auth can-i --as=system:serviceaccount:<ns>:<sa-name> get jobs -n <ns>
- kubectl auth can-i --as=system:serviceaccount:<ns>:<sa-name> create jobs -n <ns>
- And the same for pods: get/create/list/watch.
7) Resource pressure/quota
- Ensure the namespace is not constrained by Quotas/LimitRanges that would prevent pod creation:
- kubectl get resourcequota -n <ns>
- kubectl describe quota -n <ns>
- kubectl get pods -n <ns>
to see if pods are Pending due to insufficient resources.
Why the specific messages matterMarvin
09/04/2025, 1:05 PMkubectl
from inside the cluster also experiences intermittent timeouts
References
- Prefect 3 Kubernetes worker: How to run on Kubernetes
- Workers overview: Workers concepts
- Kubernetes TTL for finished Jobs: TTL docs
- Azure LB idle timeout: Idle timeout on Azure Load Balancer
Happy to dig into logs or config snippets if you paste them here.