I am having a lot of trouble with liveness and readiness pro Prefect Community #prefect-server

I am having a lot of trouble with liveness and rea...

Josh Greenhalgh

01/27/2021, 5:27 PM

I am having a lot of trouble with liveness and readiness probes of the graphql and hasura deployments. This is the current state of my pods;

Copy code

prefect-agent-55775d5684-qml7k     1/1     Running   0          5h21m
prefect-apollo-b5d9c4cd8-82txc     1/1     Running   1          28h
prefect-graphql-5f77cd4674-wmfbj   0/1     Running   21         22h
prefect-hasura-bf4dd5d95-z2j8x     0/1     Running   24         28h
prefect-job-e63a7de9-nwrxg         1/1     Running   0          20m
prefect-postgresql-0               1/1     Running   0          28h
prefect-towel-5dc84cb477-zx2rg     1/1     Running   0          28h
prefect-ui-8bd5ff9f8-2scsz         1/1     Running   0          22h

They keep on restarting and going into

CrashLoopBackOff

with errors like;

Copy code

Warning  Unhealthy  9m39s (x45 over 22h)   kubelet  Readiness probe failed: Get <http://10.40.14.208:4201/health>: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  9m37s (x41 over 22h)   kubelet  Liveness probe failed: Get <http://10.40.14.208:4201/health>: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
  Warning  Unhealthy  5m18s (x35 over 109m)  kubelet  Liveness probe failed: Get <http://10.40.14.208:4201/health>: dial tcp 10.40.14.208:4201: connect: connection refused

Has anybody had similar problems and did they manage to solve them?

Zanie

01/27/2021, 5:51 PM

Hi @Josh Greenhalgh — I haven’t seen this before. What overrides do you have? Are the nodes large enough for them? Can you look at the logs for the pods?

Josh Greenhalgh

01/27/2021, 5:52 PM

Overrides are pretty basic;

Copy code

agent:

  # enabled determines if the Prefect Kubernetes agent is deployed
  enabled: true
  prefectLabels: ['prefect-agent']

ui:
  apolloApiUrl: http://***:4200/graphql

postgresql:
  postgresqlPassword: ***

serviceAccount:
  name: "***"

Josh Greenhalgh

01/27/2021, 5:54 PM

very small nodes...

Josh Greenhalgh

01/27/2021, 5:54 PM

That may be it I suppose - 1cpu 4Gb

Josh Greenhalgh

01/27/2021, 6:03 PM

So looking in the helm chart can't find where each deployment specifies it's resource requirements - what should a minimum node size be?

Mariia Kerimova

01/27/2021, 7:07 PM

Hey Josh. I doubt that it's an issue with cpu and memory constraints, and you should be able run prefect server with that node size. are you able to post some pod logs?

Josh Greenhalgh

01/28/2021, 11:43 AM

@Mariia Kerimova I think the nodes were being shared with composer airflow so it wasn't getting anywhere close to the full nodes resources - going to try in an isolated node pool and see how that goes - if I still have the same issues will dig into the logs to share

👍 1

11 Views

Open in Slack

Previous Next