I m running prefect in kubernetes with preemptible nodes may Prefect Community #ask-community

I’m running prefect in kubernetes with preemptible...

Joël Luijmes

12/22/2020, 8:25 AM

I’m running prefect in kubernetes with preemptible nodes (maybe not the best idea.. 🤫 ). Regardless I just noticed some undesired behavior. At some point the agent fails to poll the graphql service, but then it just hangs. Agreed, the agent should be able to access the backend, and admittedly my setup might not be optimal, but I don’t think hanging is the right behavior when it cant access the backend. What is your idea on this? And, can we improve this? Possible suggestions • Increase the retry limit (to unlimited) and the backoff_factor • Maybe more specific for the KubernetesAgent, but adding health checks seems like a good idea nevertheless. This allows kubernetes to monitor the agent and restart the agent when it does hang.

Joël Luijmes

12/22/2020, 8:25 AM

These were the logs

Copy code

[2020-12-22 01:32:50,084] INFO - agent | Starting KubernetesAgent with labels []
[2020-12-22 01:32:50,084] INFO - agent | Agent documentation can be found at <https://docs.prefect.io/orchestration/>
[2020-12-22 01:32:50,084] INFO - agent | Agent connecting to the Prefect API at <http://prefect-apollo.prefect:4200/graphql/>
[2020-12-22 01:32:51,106] ERROR - agent | There was an error connecting to <http://prefect-apollo.prefect:4200/graphql/>
[2020-12-22 01:32:51,106] ERROR - agent | [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'locations': [{'line': 1, 'column': 9}], 'path': ['hello'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
[2020-12-22 01:32:51,106] INFO - agent | Waiting for flow runs...
[2020-12-22 01:32:52,126] ERROR - agent | [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_runs_in_queue'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
[2020-12-22 01:32:53,590] ERROR - agent | [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_runs_in_queue'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
[2020-12-22 01:32:55,655] ERROR - agent | [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_runs_in_queue'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
[2020-12-22 01:32:58,719] ERROR - agent | [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_runs_in_queue'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
[2020-12-22 01:33:03,768] ERROR - agent | [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_runs_in_queue'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
[2020-12-22 02:09:45,236] ERROR - agent | HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Max retries exceeded with url: /graphql (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f85aeebd3d0>, 'Connection to prefect-apollo.prefect timed out. (connect timeout=30)'))

For now I replicated the graphql to 2 replicas on different nodes to mitigate the problem 😉 But I think it would be nice if the behavior in it self was improved.

Jim Crist-Harif

12/22/2020, 3:37 PM

Hi Joel,

At some point the agent fails to poll the graphql service, but then it just hangs.

It's not clear to me from the above logs that the agent is stuck - from the above error the agent should sleep for a bit before retrying. Trying locally with a faulty address, this is the behavior I see (it retries forever, with sleeps in between, as intended). Are there no more logs after the above?

Maybe more specific for the KubernetesAgent, but adding health checks seems like a good idea nevertheless. This allows kubernetes to monitor the agent and restart the agent when it does hang.

We do have health checks, but they'd only catch a faulty process (one that can't do any work at all), not a hung connection. The health check endpoint has to be enabled, and the k8s deployment configured to use it (

prefect agent kubernetes install

does this automatically). See https://docs.prefect.io/orchestration/agents/overview.html#health-checks for more info.

Increase the retry limit (to unlimited) and the backoff_factor

What should be happening is the agent does a couple of retries at the request level (managed by

requests

) before raising to the main agent loop. This loop runs forever, logging any errors and waiting in between attempts. So after the last log above, there should be a sleep then a retry again forever. When I try locally, this is the behavior I see. If this isn't happening, I suspect things are getting stuck at a lower level. Can you provide any more information about what's triggering this (the above logs indicate that the agent could never successfully connect to the backend)?

Joël Luijmes

12/22/2020, 8:48 PM

Hey Jim,

…. Are there no more logs after the above?

Nope, this is it. The UI also reported that the agent hasn’t connected in many hours, thus I suspected that it was stuck after that log line.

We do have health checks, but they’d only catch a faulty process

Ouch, my bad. I overlooked the liveness check in the deployment 😅

What should be happening is the agent does a couple of retries at the request level (managed by
requests
) before raising to the main agent loop.

Hmm, that would be the more logical behavior.. Unfortunately, after the requests backoff errors it seems like it didn’t retry it any longer. I’ll see if I can replicate my observed behavior if I set the graphql to 0 replicas..

Jim Crist-Harif

12/22/2020, 8:52 PM

Great. If you can get some more information, this would make a good issue (I suspect this will take more iteration than slack can provide). If you wouldn't mind opening an issue at https://github.com/PrefectHQ/prefect/issues when you're ready that'd help us work towards resolving this. Thanks!

Joël Luijmes

12/22/2020, 9:07 PM

Hm, not sure but can’t seem to replicate it. I tried following 1. replicas = 0 (graphql) --> yields ECONNREFUSED indefinitely (well for couple minutes at least) 2. replicas = 2, delete serviice --> yields ENOTFOUND indefinetly 3. Mimicked the observed logs by first setting replicas = 0 and after 10 errors, delete the service. Maybe it was just some weird race condition, I’ll keep an eye on it 👀 Thanks at least 🙂

👍 1

Joël Luijmes

12/24/2020, 10:48 AM

Hey @Jim Crist-Harif, I thought I just ran into the same issue again, agent playing dead. But something else might be going on.. The UI reports that the agent hasn’t queried in 6 hours, checking the logs on the agent I see similar error (connection timeout). However, I just manually triggered a flow, and the agent does perform the flow just fine 😮 Could it be that the “ping/health” loop breaks, but that it doesn’t affect the “checking for work” loop?

Joël Luijmes

12/24/2020, 10:49 AM

Logs

Copy code

prefect-agent-6fc45b78c7-7ztlz agent [2020-12-24 05:12:26,655] ERROR - agent | [{'message': 'request to <http://prefect-hasura.prefect:3000/v1alpha1/graphql> failed, reason: connect ECONNREFUSED 10.32.4.23:3000', 'locations': [{'line': 2, 'column': 5}], 'path': ['tenant'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-hasura.prefect:3000/v1alpha1/graphql> failed, reason: connect ECONNREFUSED 10.32.4.23:3000', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
prefect-agent-6fc45b78c7-7ztlz agent [2020-12-24 05:12:35,743] ERROR - agent | [{'message': 'request to <http://prefect-hasura.prefect:3000/v1alpha1/graphql> failed, reason: connect ECONNREFUSED 10.32.4.23:3000', 'locations': [{'line': 2, 'column': 5}], 'path': ['tenant'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-hasura.prefect:3000/v1alpha1/graphql> failed, reason: connect ECONNREFUSED 10.32.4.23:3000', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
prefect-agent-6fc45b78c7-7ztlz agent [2020-12-24 10:02:24,953] ERROR - agent | HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Read timed out. (read timeout=30)
prefect-agent-6fc45b78c7-7ztlz agent [2020-12-24 10:42:23,112] INFO - agent | Found 1 flow run(s) to submit for execution.
prefect-agent-6fc45b78c7-7ztlz agent [2020-12-24 10:42:23,223] INFO - agent | Deploying flow run 927e9eb6-e5bf-4e50-a67e-83b0f9680634

Joël Luijmes

12/24/2020, 11:06 AM

I’ve put the agent logs on debug, hopefully that gives some more info. Are there also some logs at the backend about the agent I could collect?

Dylan

12/24/2020, 3:29 PM

Hey @Joël Luijmes! By any chance, is your agent running a newer version of Prefect than your Prefect Server instance?

Dylan

12/24/2020, 3:30 PM

Also, the team works on a rotation to help everyone in our community. It’s totally fine to send new replies to threads to the channel, but please avoid pinging individual team members

👍 1

Joël Luijmes

12/24/2020, 4:49 PM

Uhm not sure what the best way to check the versions. But it seems that the agent is running 0.14 and the “core version” is 0.13.19+173.GF2EEEB0A4

Joël Luijmes

12/24/2020, 4:50 PM

I’m using the latest helm chart, so it should pull the latest versions. Ill dive into it to see whats wrong… 🤓

Joël Luijmes

12/24/2020, 4:59 PM

Hm the default tag in the chart is “latest” so maybe something happend there so the images didn’t align properly 🤷

Dylan

12/24/2020, 5:16 PM

The best rule of thumb is to keep your agent + server deployment ahead of your core version

Dylan

12/24/2020, 5:17 PM

The agent and server can support old versions of core

Joël Luijmes

12/24/2020, 5:19 PM

Okay, and speaking in the helm services, which do you call server or core?

Dylan

12/24/2020, 5:58 PM

Prefect Core is the flows themselves. So, if you’re running with Kubernetes, the version of Prefect that’s running in the k8s jobs

Dylan

12/24/2020, 5:58 PM

that are created by the agent

Dylan

12/24/2020, 5:58 PM

Prefect Server includes of all of the helm services, minus the agent

2 Views

Open in Slack

Previous Next