I’m running prefect in kubernetes with preemptible...
# prefect-community
j
I’m running prefect in kubernetes with preemptible nodes (maybe not the best idea.. 🤫 ). Regardless I just noticed some undesired behavior. At some point the agent fails to poll the graphql service, but then it just hangs. Agreed, the agent should be able to access the backend, and admittedly my setup might not be optimal, but I don’t think hanging is the right behavior when it cant access the backend. What is your idea on this? And, can we improve this? Possible suggestions • Increase the retry limit (to unlimited) and the backoff_factor • Maybe more specific for the KubernetesAgent, but adding health checks seems like a good idea nevertheless. This allows kubernetes to monitor the agent and restart the agent when it does hang.
These were the logs
Copy code
[2020-12-22 01:32:50,084] INFO - agent | Starting KubernetesAgent with labels []
[2020-12-22 01:32:50,084] INFO - agent | Agent documentation can be found at <https://docs.prefect.io/orchestration/>
[2020-12-22 01:32:50,084] INFO - agent | Agent connecting to the Prefect API at <http://prefect-apollo.prefect:4200/graphql/>
[2020-12-22 01:32:51,106] ERROR - agent | There was an error connecting to <http://prefect-apollo.prefect:4200/graphql/>
[2020-12-22 01:32:51,106] ERROR - agent | [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'locations': [{'line': 1, 'column': 9}], 'path': ['hello'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
[2020-12-22 01:32:51,106] INFO - agent | Waiting for flow runs...
[2020-12-22 01:32:52,126] ERROR - agent | [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_runs_in_queue'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
[2020-12-22 01:32:53,590] ERROR - agent | [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_runs_in_queue'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
[2020-12-22 01:32:55,655] ERROR - agent | [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_runs_in_queue'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
[2020-12-22 01:32:58,719] ERROR - agent | [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_runs_in_queue'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
[2020-12-22 01:33:03,768] ERROR - agent | [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_runs_in_queue'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
[2020-12-22 02:09:45,236] ERROR - agent | HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Max retries exceeded with url: /graphql (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f85aeebd3d0>, 'Connection to prefect-apollo.prefect timed out. (connect timeout=30)'))
For now I replicated the graphql to 2 replicas on different nodes to mitigate the problem 😉 But I think it would be nice if the behavior in it self was improved.
j
Hi Joel,
At some point the agent fails to poll the graphql service, but then it just hangs.
It's not clear to me from the above logs that the agent is stuck - from the above error the agent should sleep for a bit before retrying. Trying locally with a faulty address, this is the behavior I see (it retries forever, with sleeps in between, as intended). Are there no more logs after the above?
Maybe more specific for the KubernetesAgent, but adding health checks seems like a good idea nevertheless. This allows kubernetes to monitor the agent and restart the agent when it does hang.
We do have health checks, but they'd only catch a faulty process (one that can't do any work at all), not a hung connection. The health check endpoint has to be enabled, and the k8s deployment configured to use it (
prefect agent kubernetes install
does this automatically). See https://docs.prefect.io/orchestration/agents/overview.html#health-checks for more info.
Increase the retry limit (to unlimited) and the backoff_factor
What should be happening is the agent does a couple of retries at the request level (managed by
requests
) before raising to the main agent loop. This loop runs forever, logging any errors and waiting in between attempts. So after the last log above, there should be a sleep then a retry again forever. When I try locally, this is the behavior I see. If this isn't happening, I suspect things are getting stuck at a lower level. Can you provide any more information about what's triggering this (the above logs indicate that the agent could never successfully connect to the backend)?
j
Hey Jim,
…. Are there no more logs after the above?
Nope, this is it. The UI also reported that the agent hasn’t connected in many hours, thus I suspected that it was stuck after that log line.
We do have health checks, but they’d only catch a faulty process
Ouch, my bad. I overlooked the liveness check in the deployment 😅
What should be happening is the agent does a couple of retries at the request level (managed by 
requests
) before raising to the main agent loop.
Hmm, that would be the more logical behavior.. Unfortunately, after the requests backoff errors it seems like it didn’t retry it any longer. I’ll see if I can replicate my observed behavior if I set the graphql to 0 replicas..
j
Great. If you can get some more information, this would make a good issue (I suspect this will take more iteration than slack can provide). If you wouldn't mind opening an issue at https://github.com/PrefectHQ/prefect/issues when you're ready that'd help us work towards resolving this. Thanks!
j
Hm, not sure but can’t seem to replicate it. I tried following 1. replicas = 0 (graphql) --> yields ECONNREFUSED indefinitely (well for couple minutes at least) 2. replicas = 2, delete serviice --> yields ENOTFOUND indefinetly 3. Mimicked the observed logs by first setting replicas = 0 and after 10 errors, delete the service. Maybe it was just some weird race condition, I’ll keep an eye on it 👀 Thanks at least 🙂
👍 1
Hey @Jim Crist-Harif, I thought I just ran into the same issue again, agent playing dead. But something else might be going on.. The UI reports that the agent hasn’t queried in 6 hours, checking the logs on the agent I see similar error (connection timeout). However, I just manually triggered a flow, and the agent does perform the flow just fine 😮 Could it be that the “ping/health” loop breaks, but that it doesn’t affect the “checking for work” loop?
Logs
Copy code
prefect-agent-6fc45b78c7-7ztlz agent [2020-12-24 05:12:26,655] ERROR - agent | [{'message': 'request to <http://prefect-hasura.prefect:3000/v1alpha1/graphql> failed, reason: connect ECONNREFUSED 10.32.4.23:3000', 'locations': [{'line': 2, 'column': 5}], 'path': ['tenant'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-hasura.prefect:3000/v1alpha1/graphql> failed, reason: connect ECONNREFUSED 10.32.4.23:3000', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
prefect-agent-6fc45b78c7-7ztlz agent [2020-12-24 05:12:35,743] ERROR - agent | [{'message': 'request to <http://prefect-hasura.prefect:3000/v1alpha1/graphql> failed, reason: connect ECONNREFUSED 10.32.4.23:3000', 'locations': [{'line': 2, 'column': 5}], 'path': ['tenant'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-hasura.prefect:3000/v1alpha1/graphql> failed, reason: connect ECONNREFUSED 10.32.4.23:3000', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
prefect-agent-6fc45b78c7-7ztlz agent [2020-12-24 10:02:24,953] ERROR - agent | HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Read timed out. (read timeout=30)
prefect-agent-6fc45b78c7-7ztlz agent [2020-12-24 10:42:23,112] INFO - agent | Found 1 flow run(s) to submit for execution.
prefect-agent-6fc45b78c7-7ztlz agent [2020-12-24 10:42:23,223] INFO - agent | Deploying flow run 927e9eb6-e5bf-4e50-a67e-83b0f9680634
I’ve put the agent logs on debug, hopefully that gives some more info. Are there also some logs at the backend about the agent I could collect?
d
Hey @Joël Luijmes! By any chance, is your agent running a newer version of Prefect than your Prefect Server instance?
Also, the team works on a rotation to help everyone in our community. It’s totally fine to send new replies to threads to the channel, but please avoid pinging individual team members
👍 1
j
Uhm not sure what the best way to check the versions. But it seems that the agent is running 0.14 and the “core version” is 0.13.19+173.GF2EEEB0A4
I’m using the latest helm chart, so it should pull the latest versions. Ill dive into it to see whats wrong… 🤓
Hm the default tag in the chart is “latest” so maybe something happend there so the images didn’t align properly 🤷
d
The best rule of thumb is to keep your agent + server deployment ahead of your core version
The agent and server can support old versions of core
j
Okay, and speaking in the helm services, which do you call server or core?
d
Prefect Core is the flows themselves. So, if you’re running with Kubernetes, the version of Prefect that’s running in the k8s jobs
that are created by the agent
Prefect Server includes of all of the helm services, minus the agent