https://prefect.io logo
Title
j

Josh Greenhalgh

03/09/2021, 1:09 PM
Hey can anyone help me understand this;
NAME                                     READY   STATUS    RESTARTS   AGE
prefect-agent-c58b946f9-9r59j            1/1     Running   280        21h
prefect-server-apollo-78c9b8cbfb-bd69r   1/1     Running   0          3d12h
prefect-server-graphql-875f7ddc-pntjp    1/1     Running   0          3d12h
prefect-server-hasura-7897f76bcf-mtphx   1/1     Running   0          3d12h
prefect-server-towel-6d9c9748f4-q9mrc    1/1     Running   0          3d12h
prefect-server-ui-55f4bcb597-mmz4c       1/1     Running   0          3d12h
The agent consistently restarts every 5 mins or so - is this expected? If not any idea how to solve? I am using the output of
prefect agent kubernetes install
as the spec for the deployment
m

Mariia Kerimova

03/09/2021, 1:25 PM
Hello Josh! So, there are couple reasons which can trigger pod restarts. Can you provide following information: What version of Prefect are you using? Can you describe the pod and share events from the pod? (run
kubectl describe po prefect-agent-c58b946f9-9r59j
) Do you set memory limits on the agent?
j

Josh Greenhalgh

03/09/2021, 2:26 PM
version: prefecthq/prefect:0.14.11-python3.8 describe;
Events:
  Type     Reason     Age                     From     Message
  ----     ------     ----                    ----     -------
  Normal   Pulled     33m (x291 over 22h)     kubelet  Successfully pulled image "prefecthq/prefect:0.14.11-python3.8"
  Warning  Unhealthy  13m (x590 over 22h)     kubelet  Liveness probe failed: Get <http://10.32.1.19:8080/api/health>: dial tcp 10.32.1.19:8080: connect: connection refused
  Warning  BackOff    8m47s (x3591 over 22h)  kubelet  Back-off restarting failed container
  Normal   Pulling    3m39s (x298 over 22h)   kubelet  Pulling image "prefecthq/prefect:0.14.11-python3.8"
memory limits: nope
This is full def using terrafrom k8s provider (if it helps);
resource "kubernetes_deployment" "prefect_agent" {
  metadata {
    name      = "prefect-agent"
    namespace = kubernetes_namespace.prefect.metadata.0.name
    labels = {
      app = "prefect-agent"
    }
  }

  spec {
    replicas = 1

    selector {
      match_labels = {
        app = "prefect-agent"
      }
    }

    template {
      metadata {
        labels = {
          app = "prefect-agent"
        }
      }

      spec {
        node_selector = {
          "<http://cloud.google.com/gke-nodepool|cloud.google.com/gke-nodepool>" = google_container_node_pool.fixed_compute.name
        }
        container {
          name    = "agent"
          image   = "prefecthq/prefect:0.14.11-python3.8"
          command = ["/bin/bash", "-c"]
          args    = ["prefect agent kubernetes start"]

          env {
            name  = "PREFECT__CLOUD__API"
            value = "http://<HIDDEN>:4200/graphql"
          }

          env {
            name  = "NAMESPACE"
            value = "prefect"
          }

          env {
            name  = "PREFECT__CLOUD__AGENT__LABELS"
            value = "['prefect-agent']"
          }

          env {
            name  = "PREFECT__BACKEND"
            value = "server"
          }

          resources {
            limits = {
              cpu    = "100m"
              memory = "128Mi"
            }
          }

          liveness_probe {
            http_get {
              path = "/api/health"
              port = "8080"
            }

            initial_delay_seconds = 40
            period_seconds        = 40
            failure_threshold     = 2
          }

          image_pull_policy = "Always"
        }
      }
    }
  }
}
z

Zanie

03/09/2021, 3:01 PM
I believe you need to set the local address for the health check server to run ie https://github.com/PrefectHQ/server/blob/master/helm/prefect-server/templates/agent/deployment.yaml#L68-L69
j

Josh Greenhalgh

03/09/2021, 3:54 PM
hmmm - ok I removed that since it appeared to be related to
cloud
version?
Thanks!
z

Zanie

03/09/2021, 4:00 PM
There are a few instances where
CLOUD
settings are just "backend" settings that we didn't have a better name for when Server was split out.
j

Josh Greenhalgh

03/09/2021, 4:04 PM
Ok so the issue is that that env var is required to set up some endpoint that the healthcheck probes? In my case there is nothing there so it keeps failing?