Hello there I m having some issues with my Istio injected ku Prefect Community #ask-community

Hello there! I'm having some issues with my Istio-...

Gabriel Milan

01/07/2022, 12:54 PM

Hello there! I'm having some issues with my Istio-injected kubernetes jobs: I've deployed the Prefect Helm chart into an Istio-injected namespace in one cluster (say cluster-one) with agent (agent-one) enabled. I've also deployed another Kubernetes agent (agent-two) in another cluster (say cluster-two; also in an Istio-injected namespace which shares the same Istio Service Mesh as cluster-one). Both agent-one and agent-two can successfully register with the apollo server (deployed on cluster-one) and query for runs. Unfortunately, when jobs are launched (either by agent-one or agent-two), I get the following exception:

requests.exceptions.ConnectionError: HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7efe788176d0>: Failed to establish a new connection: [Errno 111] Connection refused'))

But the weird thing is that if I open a terminal inside the same job pod and try to

curl prefect-apollo.prefect:4200

, I can successfully get an answer from the apollo server. Has anyone had anything similar before? Or tried Istio with Prefect on Kubernetes?

Anna Geller

01/07/2022, 1:01 PM

Is there any specific reason why you decided to have it all separated into different clusters? I think if both Server and agent would be deployed to the same cluster, it would make it much easier from the networking and management perspective

Anna Geller

01/07/2022, 1:05 PM

Did you add this config on both agents?

Copy code

[server]
endpoint = "<http://YOUR_MACHINES_PUBLIC_IP:4200/graphql>"

Gabriel Milan

01/07/2022, 1:10 PM

in our use case, agents must be split among multiple clusters (for cost splitting reasons)

Gabriel Milan

01/07/2022, 1:10 PM

how can I check this config?

Anna Geller

01/07/2022, 1:11 PM

I think you would need to exec to the agent pod and check ~/.prefect/config.toml

Gabriel Milan

01/07/2022, 3:30 PM

there's no such file in my agent pods.

~/.prefect

does exist, but the config file doesn't

Gabriel Milan

01/07/2022, 3:30 PM

I haven't changed any default configuration from the agent itself but the prefect labels

Anna Geller

01/07/2022, 3:35 PM

Gotcha. Could you try adding that file on the agent pod?

Gabriel Milan

01/07/2022, 3:41 PM

but shouldn't this be another thing? as I've mentioned, both of my agents seem to be fine (registered, querying for runs and even launching them successfully). aside from that, there's not a public ip for my apollo deployment, I'm using

prefect-apollo

as the hostname, since I've got

prefect-apollo

service in my namespaces

Anna Geller

01/07/2022, 3:47 PM

I don’t know how such service would work instead of adding the host name, but if you don’t tell your agent which apollo endpoint it should use to poll for flow runs scheduled by your Server instance, it will default to Prefect Cloud’s endpoint rather than your Server endpoint. I would connect the agent via host name because this is a pattern that Server supports for sure.

Gabriel Milan

01/07/2022, 3:50 PM

my agent deployment didn't have the

config.toml

file, but it does have the

PREFECT__CLOUD__API

set to

<http://prefect-apollo.prefect:4200/graphql>

, which works perfectly fine. I'm not having issues with my agent.

Gabriel Milan

01/07/2022, 3:50 PM

as you can see from the screenshot, both agent-one and agent-two are registered to my prefect server instance

Gabriel Milan

01/07/2022, 3:51 PM

the thing is that the jobs launched by them are raising exceptions, even being on the same namespace as the agents themselves

Anna Geller

01/07/2022, 3:53 PM

Gotcha. I’ll ask more questions to gather more information 1. Do you use the same Prefect version across all components? 2. Can you share how do you define your run_config and storage? 3. How and from where do you register your flows?

Anna Geller

01/07/2022, 3:56 PM

4. Did you define labels for your agents? 5. Are you explicitly matching your flows with specific agents? It looks indeed like your agents are registered but something is wrong either in flow registration or the flows being picked up by agents and executed with Server backend

Gabriel Milan

01/07/2022, 4:33 PM

alright, so 1. yes, I've pinned everything to

0.15.9

and

core-0.15.9

Copy code

say_hello_flow.storage = GCS(constants.GCS_FLOWS_BUCKET.value) # datario-public
say_hello_flow.run_config = KubernetesRun(image=constants.DOCKER_IMAGE.value) # <http://ghcr.io/prefeitura-rio/prefect-flows:0a676d98950581e01d0b713cf0acaa4b722fbf6e|ghcr.io/prefeitura-rio/prefect-flows:0a676d98950581e01d0b713cf0acaa4b722fbf6e>

3. we use gh actions for building our image (which is based on prefect 0.15.9 and we add dependencies to it) and registering our flows with

prefect register --project $PREFECT__SERVER__PROJECT -p pipelines/

through kubectl port forward 4. yes I did, agent-one has label

emd

and agent-two has label

rj-sme

5. I'm not sure I understood the question, but I use labels for matching my flow runs with my agents full log for a job, if it helps:

Copy code

Using deprecated annotation `<http://kubectl.kubernetes.io/default-logs-container|kubectl.kubernetes.io/default-logs-container>` in pod/prefect-job-2dd42001-kp5vm. Please use `<http://kubectl.kubernetes.io/default-container|kubectl.kubernetes.io/default-container>` instead
Traceback (most recent call last):
  File "/opt/venv/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/opt/venv/lib/python3.9/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/opt/venv/lib/python3.9/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/opt/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/opt/venv/lib/python3.9/site-packages/urllib3/connection.py", line 239, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/local/lib/python3.9/http/client.py", line 1285, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1331, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1280, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/lib/python3.9/http/client.py", line 1040, in _send_output
    self.send(msg)
  File "/usr/local/lib/python3.9/http/client.py", line 980, in send
    self.connect()
  File "/opt/venv/lib/python3.9/site-packages/urllib3/connection.py", line 205, in connect
    conn = self._new_conn()
  File "/opt/venv/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f7e839886d0>: Failed to establish a new connection: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venv/lib/python3.9/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/opt/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "/opt/venv/lib/python3.9/site-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7e839886d0>: Failed to establish a new connection: [Errno 111] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venv/bin/prefect", line 8, in <module>
    sys.exit(cli())
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/prefect/cli/execute.py", line 53, in flow_run
    result = client.graphql(query)
  File "/opt/venv/lib/python3.9/site-packages/prefect/client/client.py", line 548, in graphql
    result = <http://self.post|self.post>(
  File "/opt/venv/lib/python3.9/site-packages/prefect/client/client.py", line 451, in post
    response = self._request(
  File "/opt/venv/lib/python3.9/site-packages/prefect/client/client.py", line 737, in _request
    response = self._send_request(
  File "/opt/venv/lib/python3.9/site-packages/prefect/client/client.py", line 602, in _send_request
    response = <http://session.post|session.post>(
  File "/opt/venv/lib/python3.9/site-packages/requests/sessions.py", line 590, in post
    return self.request('POST', url, data=data, json=json, **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/venv/lib/python3.9/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/opt/venv/lib/python3.9/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7e839886d0>: Failed to establish a new connection: [Errno 111] Connection refused'))

Gabriel Milan

01/10/2022, 2:57 PM

Hello there! Are there any updates about this?

Anna Geller

01/10/2022, 3:11 PM

Hi Gabriel, I don’t see any obvious issues what may be wrong. I think you may try to explicitly pass agent labels to your KubernetesRun, that’s what I meant with #5.

Copy code

say_hello_flow.run_config = KubernetesRun(image=constants.DOCKER_IMAGE.value, labels=["emd"])

Additionally, perhaps you can try adding a custom Kubernetes job template and pass it to KubernetesRun as well? The error seems to be a permission issue. Perhaps it can be solved by attaching a service account inside of your flow’s job template? Sorry if I’m not too helpful here, but this is a bit complex DevOps & networking issue with this multi-cluster Instio-injected Server deployments. I’ll share your issue, maybe others from engineering can help more.

Gabriel Milan

01/10/2022, 3:37 PM

As I'm launching it from the UI, I'm adding the

emd

label to it. I'll try that with the job template. I'd be glad to hear from others too, but I appreciate your help so far. Thank you!

👍 1

Gabriel Milan

01/10/2022, 8:42 PM

just an update from our side: we've managed to do it by switching to linkerd instead of istio. it seems like istio-proxy sidecar would take too long to setup and it wasn't enough for jobs to connect to the apollo server. as linkerd has https://github.com/linkerd/linkerd-await it was easy to set it up with our jobs. it finally works! thanks again!

Anna Geller

01/10/2022, 8:45 PM

thanks for the update! and great you figured that out

👍 1

5 Views

Open in Slack

Previous Next