Joël Luijmes
12/22/2020, 8:25 AM[2020-12-22 01:32:50,084] INFO - agent | Starting KubernetesAgent with labels []
[2020-12-22 01:32:50,084] INFO - agent | Agent documentation can be found at <https://docs.prefect.io/orchestration/>
[2020-12-22 01:32:50,084] INFO - agent | Agent connecting to the Prefect API at <http://prefect-apollo.prefect:4200/graphql/>
[2020-12-22 01:32:51,106] ERROR - agent | There was an error connecting to <http://prefect-apollo.prefect:4200/graphql/>
[2020-12-22 01:32:51,106] ERROR - agent | [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'locations': [{'line': 1, 'column': 9}], 'path': ['hello'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
[2020-12-22 01:32:51,106] INFO - agent | Waiting for flow runs...
[2020-12-22 01:32:52,126] ERROR - agent | [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_runs_in_queue'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
[2020-12-22 01:32:53,590] ERROR - agent | [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_runs_in_queue'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
[2020-12-22 01:32:55,655] ERROR - agent | [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_runs_in_queue'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
[2020-12-22 01:32:58,719] ERROR - agent | [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_runs_in_queue'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
[2020-12-22 01:33:03,768] ERROR - agent | [{'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'locations': [{'line': 2, 'column': 5}], 'path': ['get_runs_in_queue'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-graphql.prefect:4201/graphql/> failed, reason: connect ECONNREFUSED 10.32.6.84:4201', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
[2020-12-22 02:09:45,236] ERROR - agent | HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Max retries exceeded with url: /graphql (Caused by ConnectTimeoutError(<urllib3.connection.HTTPConnection object at 0x7f85aeebd3d0>, 'Connection to prefect-apollo.prefect timed out. (connect timeout=30)'))
For now I replicated the graphql to 2 replicas on different nodes to mitigate the problem 😉 But I think it would be nice if the behavior in it self was improved.Jim Crist-Harif
12/22/2020, 3:37 PMAt some point the agent fails to poll the graphql service, but then it just hangs.It's not clear to me from the above logs that the agent is stuck - from the above error the agent should sleep for a bit before retrying. Trying locally with a faulty address, this is the behavior I see (it retries forever, with sleeps in between, as intended). Are there no more logs after the above?
Maybe more specific for the KubernetesAgent, but adding health checks seems like a good idea nevertheless. This allows kubernetes to monitor the agent and restart the agent when it does hang.We do have health checks, but they'd only catch a faulty process (one that can't do any work at all), not a hung connection. The health check endpoint has to be enabled, and the k8s deployment configured to use it (
prefect agent kubernetes install
does this automatically). See https://docs.prefect.io/orchestration/agents/overview.html#health-checks for more info.
Increase the retry limit (to unlimited) and the backoff_factorWhat should be happening is the agent does a couple of retries at the request level (managed by
requests
) before raising to the main agent loop. This loop runs forever, logging any errors and waiting in between attempts. So after the last log above, there should be a sleep then a retry again forever. When I try locally, this is the behavior I see. If this isn't happening, I suspect things are getting stuck at a lower level. Can you provide any more information about what's triggering this (the above logs indicate that the agent could never successfully connect to the backend)?Joël Luijmes
12/22/2020, 8:48 PM…. Are there no more logs after the above?Nope, this is it. The UI also reported that the agent hasn’t connected in many hours, thus I suspected that it was stuck after that log line.
We do have health checks, but they’d only catch a faulty processOuch, my bad. I overlooked the liveness check in the deployment 😅
What should be happening is the agent does a couple of retries at the request level (managed byHmm, that would be the more logical behavior.. Unfortunately, after the requests backoff errors it seems like it didn’t retry it any longer. I’ll see if I can replicate my observed behavior if I set the graphql to 0 replicas..) before raising to the main agent loop.requests
Jim Crist-Harif
12/22/2020, 8:52 PMJoël Luijmes
12/22/2020, 9:07 PMprefect-agent-6fc45b78c7-7ztlz agent [2020-12-24 05:12:26,655] ERROR - agent | [{'message': 'request to <http://prefect-hasura.prefect:3000/v1alpha1/graphql> failed, reason: connect ECONNREFUSED 10.32.4.23:3000', 'locations': [{'line': 2, 'column': 5}], 'path': ['tenant'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-hasura.prefect:3000/v1alpha1/graphql> failed, reason: connect ECONNREFUSED 10.32.4.23:3000', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
prefect-agent-6fc45b78c7-7ztlz agent [2020-12-24 05:12:35,743] ERROR - agent | [{'message': 'request to <http://prefect-hasura.prefect:3000/v1alpha1/graphql> failed, reason: connect ECONNREFUSED 10.32.4.23:3000', 'locations': [{'line': 2, 'column': 5}], 'path': ['tenant'], 'extensions': {'code': 'INTERNAL_SERVER_ERROR', 'exception': {'message': 'request to <http://prefect-hasura.prefect:3000/v1alpha1/graphql> failed, reason: connect ECONNREFUSED 10.32.4.23:3000', 'type': 'system', 'errno': 'ECONNREFUSED', 'code': 'ECONNREFUSED'}}}]
prefect-agent-6fc45b78c7-7ztlz agent [2020-12-24 10:02:24,953] ERROR - agent | HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Read timed out. (read timeout=30)
prefect-agent-6fc45b78c7-7ztlz agent [2020-12-24 10:42:23,112] INFO - agent | Found 1 flow run(s) to submit for execution.
prefect-agent-6fc45b78c7-7ztlz agent [2020-12-24 10:42:23,223] INFO - agent | Deploying flow run 927e9eb6-e5bf-4e50-a67e-83b0f9680634
Dylan
Joël Luijmes
12/24/2020, 4:49 PMDylan
Joël Luijmes
12/24/2020, 5:19 PMDylan