Scott Zelenka
07/29/2020, 3:31 PMagent
container is failing liveness probes, and killing the Pod.
Warning Unhealthy 43s (x2 over 83s) kubelet, .internal Liveness probe failed: Get <http://10.0.28.4:8080/api/health> : dial tcp 10.0.28.4:8080: connect: connection refused
I'm chatting with AWS right now, and they're asserting that because kubelet is attempting to connect on port 8080, that the Deployment YAML for agent
should be exposing that. But I don't have this problem when deploying the same YAML on GCP, OpenShift, or bare metal K8.josh
07/29/2020, 3:38 PM- name: PREFECT__CLOUD__AGENT__AGENT_ADDRESS
value: "http://:8080"
Scott Zelenka
07/29/2020, 3:46 PMnetstat -antlp | grep 8080
tcp 0 0 127.0.0.1:8080 0.0.0.0:* LISTEN 1/python
So perhaps the other K8 deployments are issuing the liveness probe to localhost
while AWS' EKS uses their VPC subnets for Pods and is attempting to access the actual IP of the Pod at port 8080 .. and because the service doesn't appear to be listening on all addresses, it's rejectedPREFECT__CLOUD__AGENT__AGENT_ADDRESS
with the IP from the Pod?josh
07/29/2020, 3:50 PM- name: PREFECT__CLOUD__AGENT__AGENT_ADDRESS
valueFrom:
fieldRef:
fieldPath: status.podIP
(from https://kubernetes.io/docs/tasks/inject-data-application/environment-variable-expose-pod-information/)
I also wonder if you can set the liveness probe host directly:
livenessProbe:
httpGet:
host: pod_ip_here
path: /api/health
port: 8080
initialDelaySeconds: 40
periodSeconds: 40
failureThreshold: 2
Scott Zelenka
07/29/2020, 4:05 PMPREFECT__CLOUD__AGENT__AGENT_ADDRESS
on the other K8 clusters, and it seems they were set to http://:8080
which appears to listen on all addresses of the Pod.
For whatever reason, when I ran prefect agent install kubernetes ...
for the AWS deployment, it changed that value to <http://127.0.0.1:8080>
.
Manually changing this to <http://0.0.0.0:8080>
allowed it to pass the liveness probes!!Sagun Garg
12/21/2020, 7:47 AMPREFECT__CLOUD__AGENT__AGENT_ADDRESS
can you please share the commands you used on the CLI to make this happen in the yaml file?Name: prefect-agent-7bcdb4f975-pftnw
Namespace: default
Priority: 2000001000
Priority Class Name: system-node-critical
Node: fargate-ip-10-0-231-193.ap-southeast-1.compute.internal/10.0.231.193
Start Time: Mon, 21 Dec 2020 14:37:38 +0800
Labels: app=prefect-agent
<http://eks.amazonaws.com/fargate-profile=fp-default|eks.amazonaws.com/fargate-profile=fp-default>
pod-template-hash=7bcdb4f975
Annotations: CapacityProvisioned: 0.25vCPU 0.5GB
Logging: LoggingDisabled: LOGGING_CONFIGMAP_NOT_FOUND
<http://kubernetes.io/psp|kubernetes.io/psp>: eks.privileged
Status: Running
IP: 10.0.231.193
IPs:
IP: 10.0.231.193
Controlled By: ReplicaSet/prefect-agent-7bcdb4f975
Containers:
agent:
Container ID: <containerd://9becbcb74d4ebc66c0c93a0fd40f5e3a15ee0c276831457c1514c752188b5c2>1
Image: prefecthq/prefect:0.14.0-python3.6
Image ID: <http://docker.io/prefecthq/prefect@sha256:3ebe46f840d46044b9521c9380aa13bd0755670d06e2fdfe0e23c69de5a78fc0|docker.io/prefecthq/prefect@sha256:3ebe46f840d46044b9521c9380aa13bd0755670d06e2fdfe0e23c69de5a78fc0>
Port: <none>
Host Port: <none>
Command:
/bin/bash
-c
Args:
prefect agent kubernetes start
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 21 Dec 2020 16:07:26 +0800
Finished: Mon, 21 Dec 2020 16:08:48 +0800
Ready: False
Restart Count: 26
Limits:
cpu: 100m
memory: 128Mi
Requests:
cpu: 100m
memory: 128Mi
Liveness: http-get http://:8080/api/health delay=40s timeout=1s period=40s #success=1 #failure=2
Environment:
PREFECT__CLOUD__AGENT__AUTH_TOKEN: gNZkGzQJohYgunuMh_okKw
PREFECT__CLOUD__API: <https://api.prefect.io>
NAMESPACE: default
IMAGE_PULL_SECRETS:
PREFECT__CLOUD__AGENT__LABELS: []
JOB_MEM_REQUEST:
JOB_MEM_LIMIT:
JOB_CPU_REQUEST:
JOB_CPU_LIMIT:
IMAGE_PULL_POLICY:
SERVICE_ACCOUNT_NAME:
PREFECT__BACKEND: cloud
PREFECT__CLOUD__AGENT__AGENT_ADDRESS: http://:8080
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-4tqz5 (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
default-token-4tqz5:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-4tqz5
Optional: false
QoS Class: Guaranteed
Node-Selectors: <none>
Tolerations: <http://node.kubernetes.io/not-ready:NoExecute|node.kubernetes.io/not-ready:NoExecute> op=Exists for 300s
<http://node.kubernetes.io/unreachable:NoExecute|node.kubernetes.io/unreachable:NoExecute> op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 4m43s (x27 over 94m) kubelet Liveness probe failed: Get <http://10.0.231.193:8080/api/health>: dial tcp 10.0.231.193:8080: connect: connection refused
Warning BackOff 18s (x270 over 90m) kubelet Back-off restarting failed container