Hello there! I'm having some issues with my Istio-...
# ask-community
g
Hello there! I'm having some issues with my Istio-injected kubernetes jobs: I've deployed the Prefect Helm chart into an Istio-injected namespace in one cluster (say cluster-one) with agent (agent-one) enabled. I've also deployed another Kubernetes agent (agent-two) in another cluster (say cluster-two; also in an Istio-injected namespace which shares the same Istio Service Mesh as cluster-one). Both agent-one and agent-two can successfully register with the apollo server (deployed on cluster-one) and query for runs. Unfortunately, when jobs are launched (either by agent-one or agent-two), I get the following exception:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7efe788176d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
But the weird thing is that if I open a terminal inside the same job pod and try to
curl prefect-apollo.prefect:4200
, I can successfully get an answer from the apollo server. Has anyone had anything similar before? Or tried Istio with Prefect on Kubernetes?
a
Is there any specific reason why you decided to have it all separated into different clusters? I think if both Server and agent would be deployed to the same cluster, it would make it much easier from the networking and management perspective
Did you add this config on both agents?
Copy code
[server]
endpoint = "<http://YOUR_MACHINES_PUBLIC_IP:4200/graphql>"
g
in our use case, agents must be split among multiple clusters (for cost splitting reasons)
how can I check this config?
a
I think you would need to exec to the agent pod and check ~/.prefect/config.toml
g
there's no such file in my agent pods.
~/.prefect
does exist, but the config file doesn't
I haven't changed any default configuration from the agent itself but the prefect labels
a
Gotcha. Could you try adding that file on the agent pod?
g
but shouldn't this be another thing? as I've mentioned, both of my agents seem to be fine (registered, querying for runs and even launching them successfully). aside from that, there's not a public ip for my apollo deployment, I'm using
prefect-apollo
as the hostname, since I've got
prefect-apollo
service in my namespaces
a
I don’t know how such service would work instead of adding the host name, but if you don’t tell your agent which apollo endpoint it should use to poll for flow runs scheduled by your Server instance, it will default to Prefect Cloud’s endpoint rather than your Server endpoint. I would connect the agent via host name because this is a pattern that Server supports for sure.
g
my agent deployment didn't have the
config.toml
file, but it does have the
PREFECT__CLOUD__API
set to
<http://prefect-apollo.prefect:4200/graphql>
, which works perfectly fine. I'm not having issues with my agent.
as you can see from the screenshot, both agent-one and agent-two are registered to my prefect server instance
the thing is that the jobs launched by them are raising exceptions, even being on the same namespace as the agents themselves
a
Gotcha. I’ll ask more questions to gather more information 1. Do you use the same Prefect version across all components? 2. Can you share how do you define your run_config and storage? 3. How and from where do you register your flows?
4. Did you define labels for your agents? 5. Are you explicitly matching your flows with specific agents? It looks indeed like your agents are registered but something is wrong either in flow registration or the flows being picked up by agents and executed with Server backend
g
alright, so 1. yes, I've pinned everything to
0.15.9
and
core-0.15.9
2.
Copy code
say_hello_flow.storage = GCS(constants.GCS_FLOWS_BUCKET.value) # datario-public
say_hello_flow.run_config = KubernetesRun(image=constants.DOCKER_IMAGE.value) # <http://ghcr.io/prefeitura-rio/prefect-flows:0a676d98950581e01d0b713cf0acaa4b722fbf6e|ghcr.io/prefeitura-rio/prefect-flows:0a676d98950581e01d0b713cf0acaa4b722fbf6e>
3. we use gh actions for building our image (which is based on prefect 0.15.9 and we add dependencies to it) and registering our flows with
prefect register --project $PREFECT__SERVER__PROJECT -p pipelines/
through kubectl port forward 4. yes I did, agent-one has label
emd
and agent-two has label
rj-sme
5. I'm not sure I understood the question, but I use labels for matching my flow runs with my agents full log for a job, if it helps:
Copy code
Using deprecated annotation `<http://kubectl.kubernetes.io/default-logs-container|kubectl.kubernetes.io/default-logs-container>` in pod/prefect-job-2dd42001-kp5vm. Please use `<http://kubectl.kubernetes.io/default-container|kubectl.kubernetes.io/default-container>` instead
Traceback (most recent call last):
  File "/opt/venv/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/opt/venv/lib/python3.9/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/opt/venv/lib/python3.9/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/opt/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/opt/venv/lib/python3.9/site-packages/urllib3/connection.py", line 239, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/local/lib/python3.9/http/client.py", line 1285, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1331, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1280, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1040, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.9/http/client.py", line 980, in send
    self.connect()
  File "/opt/venv/lib/python3.9/site-packages/urllib3/connection.py", line 205, in connect
    conn = self._new_conn()
  File "/opt/venv/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f7e839886d0>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.9/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/opt/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "/opt/venv/lib/python3.9/site-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7e839886d0>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venv/bin/prefect", line 8, in <module>
    sys.exit(cli())
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/prefect/cli/execute.py", line 53, in flow_run
    result = client.graphql(query)
  File "/opt/venv/lib/python3.9/site-packages/prefect/client/client.py", line 548, in graphql
    result = <http://self.post|self.post>(
  File "/opt/venv/lib/python3.9/site-packages/prefect/client/client.py", line 451, in post
    response = self._request(
  File "/opt/venv/lib/python3.9/site-packages/prefect/client/client.py", line 737, in _request
    response = self._send_request(
  File "/opt/venv/lib/python3.9/site-packages/prefect/client/client.py", line 602, in _send_request
    response = <http://session.post|session.post>(
  File "/opt/venv/lib/python3.9/site-packages/requests/sessions.py", line 590, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/venv/lib/python3.9/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7e839886d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
Hello there! Are there any updates about this?
a
Hi Gabriel, I don’t see any obvious issues what may be wrong. I think you may try to explicitly pass agent labels to your KubernetesRun, that’s what I meant with #5.
Copy code
say_hello_flow.run_config = KubernetesRun(image=constants.DOCKER_IMAGE.value, labels=["emd"])
Additionally, perhaps you can try adding a custom Kubernetes job template and pass it to KubernetesRun as well? The error seems to be a permission issue. Perhaps it can be solved by attaching a service account inside of your flow’s job template? Sorry if I’m not too helpful here, but this is a bit complex DevOps & networking issue with this multi-cluster Instio-injected Server deployments. I’ll share your issue, maybe others from engineering can help more.
g
As I'm launching it from the UI, I'm adding the
emd
label to it. I'll try that with the job template. I'd be glad to hear from others too, but I appreciate your help so far. Thank you!
👍 1
just an update from our side: we've managed to do it by switching to linkerd instead of istio. it seems like istio-proxy sidecar would take too long to setup and it wasn't enough for jobs to connect to the apollo server. as linkerd has https://github.com/linkerd/linkerd-await it was easy to set it up with our jobs. it finally works! thanks again!
a
thanks for the update! and great you figured that out
👍 1