<@ULVA73B9P> I had 2 prefect jobs that crashed and...
# ask-marvin
d
@Marvin I had 2 prefect jobs that crashed and afterwards the pods were never created. The error that I got is this from the first crash: . I have prefect for ingesting the data, a stackgres with k8s running on Azure. How could I solve my issue?
Copy code
Job 'subtle-porpoise-knwg7' was removed.
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/site-packages/prefect_kubernetes/worker.py", line 1218, in _get_job
    job = await batch_client.read_namespaced_job(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes_asyncio/client/api_client.py", line 192, in __call_api
    raise e
  File "/usr/local/lib/python3.11/site-packages/kubernetes_asyncio/client/api_client.py", line 185, in __call_api
    response_data = await self.request(
                    ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes_asyncio/client/rest.py", line 212, in GET
    return (await self.request("GET", url,
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/kubernetes_asyncio/client/rest.py", line 206, in request
    raise ApiException(http_resp=r)
kubernetes_asyncio.client.exceptions.ApiException: (504)
Reason: Gateway Timeout
HTTP response headers: <CIMultiDictProxy('Audit-Id': '90838263-f305-49bc-bd98-53b2e810f0ca', 'Cache-Control': 'no-cache, private', 'Date': 'Thu, 04 Sep 2025 08:07:54 GMT', 'Content-Length': '152', 'Content-Type': 'text/plain; charset=utf-8')>
HTTP response body: {"metadata":{},"status":"Failure","message":"Timeout: request did not complete within the allotted timeout","reason":"Timeout","details":{},"code":504}
m
thought for 360.4 seconds
Thanks for the details — I can help you triage this. A few notes up front: - I’m assuming you’re on Prefect 3.x with the Kubernetes worker (prefect-kubernetes integration). - The message “Job '<name>' was removed.” generally means the Kubernetes Job resource no longer exists when the worker/observer tries to read it (deleted by TTLAfterFinished, manual cleanup, or after a failure). On its own it’s not fatal, but in your case it’s paired with a 504 from the Kubernetes API server which points to a connectivity/API server issue. - A 504 during
read_namespaced_job
typically comes from the Kubernetes API server (or something in between, e.g., Azure networking) not responding in time. Given your symptoms (two jobs crashed, then no pods created and 504s), the likely culprits are: - Intermittent connectivity or timeouts between your worker and the AKS API server (LB idle timeout, private cluster networking, firewall, or transient AKS control plane load) - API server under load/slow (etcd pressure, many CRDs/controllers like StackGres adding load) - RBAC or namespace issues are less likely here (you’d usually see 403/404, not 504), but we should still verify. Quick actions to unblock 1) Restart the Kubernetes worker - If the worker lost a stable connection to the API server, a restart often clears it up. If you run the worker outside the cluster, consider running it inside AKS (in-cluster) to remove LB/firewall hops. 2) Upgrade to the latest versions - Upgrade both Prefect and prefect-kubernetes to pick up network robustness improvements: -
pip install -U prefect prefect-kubernetes
- Docs: Kubernetes worker guide and Workers concepts 3) If the worker runs outside the cluster, check Azure idle timeouts - Azure Load Balancer idle timeout can drop long-lived HTTP connections causing timeouts. Increase the idle timeout (e.g., 15–30 minutes) or run the worker in-cluster to avoid this path. - Docs: Azure LB idle timeout 4) Verify AKS control plane/API server health - From a place that can reach the API server (e.g., a debugging pod in the cluster): -
kubectl get --raw /readyz?verbose
-
kubectl get events -A --sort-by=.metadata.creationTimestamp | tail -n 200
to see recent API or admission failures - If you observe periodic slowness/timeouts across the board (kubectl also stalls), this is an AKS control plane/API server load issue. 5) Check for Job creation or cleanup behavior - Verify whether Jobs are being created at all when new flow runs start: -
kubectl get jobs -n <your-namespace>
-
kubectl get events -n <your-namespace> --sort-by=.metadata.creationTimestamp | tail -n 100
- If Jobs are immediately deleted after finishing/failing, look for very low
ttlSecondsAfterFinished
on your job spec. TTL docs: https://kubernetes.io/docs/concepts/workloads/controllers/job/#ttl-mechanism-for-finished-jobs - Prefect does clean up jobs it creates; if you’re debugging, you can temporarily keep them around by turning off cleanup/TTL in your job template (share your job template if you’d like me to review). 6) Sanity-check RBAC and namespace - Make sure the worker’s service account can read/create Jobs/Pods in the target namespace: -
kubectl auth can-i --as=system:serviceaccount:<ns>:<sa-name> get jobs -n <ns>
-
kubectl auth can-i --as=system:serviceaccount:<ns>:<sa-name> create jobs -n <ns>
- And the same for pods: get/create/list/watch. 7) Resource pressure/quota - Ensure the namespace is not constrained by Quotas/LimitRanges that would prevent pod creation: -
kubectl get resourcequota -n <ns>
-
kubectl describe quota -n <ns>
-
kubectl get pods -n <ns>
to see if pods are Pending due to insufficient resources. Why the specific messages matter
- “Job '<name>' was removed.”: Prefect’s Kubernetes observer/worker tried to track the Job but it was gone. That’s expected if a Job is cleaned up, but combined with your 504 it’s a sign the worker couldn’t consistently talk to the API or the Job was deleted quickly. - 504 Gateway Timeout: This is almost always a networking/API server responsiveness issue. In AKS, that can be due to control-plane load or idle connections being dropped by Azure networking if the worker is not in-cluster. If you can share a bit more, I can tailor the fix - Prefect version and prefect-kubernetes version - How/where the worker is running (inside AKS vs external), namespace, and service account - Whether your AKS cluster is private and any network policies/firewalls in play - A snippet of your Kubernetes worker job configuration (image, namespace, service_account_name, any TTL or cleanup settings) - Whether
kubectl
from inside the cluster also experiences intermittent timeouts References - Prefect 3 Kubernetes worker: How to run on Kubernetes - Workers overview: Workers concepts - Kubernetes TTL for finished Jobs: TTL docs - Azure LB idle timeout: Idle timeout on Azure Load Balancer Happy to dig into logs or config snippets if you paste them here.