Gabriel Milan

    Gabriel Milan

    8 months ago
    Hello there! I'm having some issues with my Istio-injected kubernetes jobs: I've deployed the Prefect Helm chart into an Istio-injected namespace in one cluster (say cluster-one) with agent (agent-one) enabled. I've also deployed another Kubernetes agent (agent-two) in another cluster (say cluster-two; also in an Istio-injected namespace which shares the same Istio Service Mesh as cluster-one). Both agent-one and agent-two can successfully register with the apollo server (deployed on cluster-one) and query for runs. Unfortunately, when jobs are launched (either by agent-one or agent-two), I get the following exception:
    requests.exceptions.ConnectionError: HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7efe788176d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
    But the weird thing is that if I open a terminal inside the same job pod and try to
    curl prefect-apollo.prefect:4200
    , I can successfully get an answer from the apollo server. Has anyone had anything similar before? Or tried Istio with Prefect on Kubernetes?
    Anna Geller

    Anna Geller

    8 months ago
    Is there any specific reason why you decided to have it all separated into different clusters? I think if both Server and agent would be deployed to the same cluster, it would make it much easier from the networking and management perspective
    Did you add this config on both agents?
    [server]
    endpoint = "<http://YOUR_MACHINES_PUBLIC_IP:4200/graphql>"
    Gabriel Milan

    Gabriel Milan

    8 months ago
    in our use case, agents must be split among multiple clusters (for cost splitting reasons)
    how can I check this config?
    Anna Geller

    Anna Geller

    8 months ago
    I think you would need to exec to the agent pod and check ~/.prefect/config.toml
    Gabriel Milan

    Gabriel Milan

    8 months ago
    there's no such file in my agent pods.
    ~/.prefect
    does exist, but the config file doesn't
    I haven't changed any default configuration from the agent itself but the prefect labels
    Anna Geller

    Anna Geller

    8 months ago
    Gotcha. Could you try adding that file on the agent pod?
    Gabriel Milan

    Gabriel Milan

    8 months ago
    but shouldn't this be another thing? as I've mentioned, both of my agents seem to be fine (registered, querying for runs and even launching them successfully). aside from that, there's not a public ip for my apollo deployment, I'm using
    prefect-apollo
    as the hostname, since I've got
    prefect-apollo
    service in my namespaces
    Anna Geller

    Anna Geller

    8 months ago
    I don’t know how such service would work instead of adding the host name, but if you don’t tell your agent which apollo endpoint it should use to poll for flow runs scheduled by your Server instance, it will default to Prefect Cloud’s endpoint rather than your Server endpoint. I would connect the agent via host name because this is a pattern that Server supports for sure.
    Gabriel Milan

    Gabriel Milan

    8 months ago
    my agent deployment didn't have the
    config.toml
    file, but it does have the
    PREFECT__CLOUD__API
    set to
    <http://prefect-apollo.prefect:4200/graphql>
    , which works perfectly fine. I'm not having issues with my agent.
    as you can see from the screenshot, both agent-one and agent-two are registered to my prefect server instance
    the thing is that the jobs launched by them are raising exceptions, even being on the same namespace as the agents themselves
    Anna Geller

    Anna Geller

    8 months ago
    Gotcha. I’ll ask more questions to gather more information1. Do you use the same Prefect version across all components? 2. Can you share how do you define your run_config and storage? 3. How and from where do you register your flows?
    4. Did you define labels for your agents? 5. Are you explicitly matching your flows with specific agents? It looks indeed like your agents are registered but something is wrong either in flow registration or the flows being picked up by agents and executed with Server backend
    Gabriel Milan

    Gabriel Milan

    8 months ago
    alright, so 1. yes, I've pinned everything to
    0.15.9
    and
    core-0.15.9
    2.
    say_hello_flow.storage = GCS(constants.GCS_FLOWS_BUCKET.value) # datario-public
    say_hello_flow.run_config = KubernetesRun(image=constants.DOCKER_IMAGE.value) # <http://ghcr.io/prefeitura-rio/prefect-flows:0a676d98950581e01d0b713cf0acaa4b722fbf6e|ghcr.io/prefeitura-rio/prefect-flows:0a676d98950581e01d0b713cf0acaa4b722fbf6e>
    3. we use gh actions for building our image (which is based on prefect 0.15.9 and we add dependencies to it) and registering our flows with
    prefect register --project $PREFECT__SERVER__PROJECT -p pipelines/
    through kubectl port forward 4. yes I did, agent-one has label
    emd
    and agent-two has label
    rj-sme
    5. I'm not sure I understood the question, but I use labels for matching my flow runs with my agents full log for a job, if it helps:
    Using deprecated annotation `<http://kubectl.kubernetes.io/default-logs-container|kubectl.kubernetes.io/default-logs-container>` in pod/prefect-job-2dd42001-kp5vm. Please use `<http://kubectl.kubernetes.io/default-container|kubectl.kubernetes.io/default-container>` instead
    Traceback (most recent call last):
      File "/opt/venv/lib/python3.9/site-packages/urllib3/connection.py", line 174, in _new_conn
        conn = connection.create_connection(
      File "/opt/venv/lib/python3.9/site-packages/urllib3/util/connection.py", line 96, in create_connection
        raise err
      File "/opt/venv/lib/python3.9/site-packages/urllib3/util/connection.py", line 86, in create_connection
        sock.connect(sa)
    ConnectionRefusedError: [Errno 111] Connection refused
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/opt/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 699, in urlopen
        httplib_response = self._make_request(
      File "/opt/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 394, in _make_request
        conn.request(method, url, **httplib_request_kw)
      File "/opt/venv/lib/python3.9/site-packages/urllib3/connection.py", line 239, in request
        super(HTTPConnection, self).request(method, url, body=body, headers=headers)
      File "/usr/local/lib/python3.9/http/client.py", line 1285, in request
        self._send_request(method, url, body, headers, encode_chunked)
      File "/usr/local/lib/python3.9/http/client.py", line 1331, in _send_request
        self.endheaders(body, encode_chunked=encode_chunked)
      File "/usr/local/lib/python3.9/http/client.py", line 1280, in endheaders
        self._send_output(message_body, encode_chunked=encode_chunked)
      File "/usr/local/lib/python3.9/http/client.py", line 1040, in _send_output
        self.send(msg)
      File "/usr/local/lib/python3.9/http/client.py", line 980, in send
        self.connect()
      File "/opt/venv/lib/python3.9/site-packages/urllib3/connection.py", line 205, in connect
        conn = self._new_conn()
      File "/opt/venv/lib/python3.9/site-packages/urllib3/connection.py", line 186, in _new_conn
        raise NewConnectionError(
    urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f7e839886d0>: Failed to establish a new connection: [Errno 111] Connection refused
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/opt/venv/lib/python3.9/site-packages/requests/adapters.py", line 439, in send
        resp = conn.urlopen(
      File "/opt/venv/lib/python3.9/site-packages/urllib3/connectionpool.py", line 755, in urlopen
        retries = retries.increment(
      File "/opt/venv/lib/python3.9/site-packages/urllib3/util/retry.py", line 574, in increment
        raise MaxRetryError(_pool, url, error or ResponseError(cause))
    urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7e839886d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
    
    During handling of the above exception, another exception occurred:
    
    Traceback (most recent call last):
      File "/opt/venv/bin/prefect", line 8, in <module>
        sys.exit(cli())
      File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1128, in __call__
        return self.main(*args, **kwargs)
      File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1053, in main
        rv = self.invoke(ctx)
      File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1659, in invoke
        return _process_result(sub_ctx.command.invoke(sub_ctx))
      File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 1395, in invoke
        return ctx.invoke(self.callback, **ctx.params)
      File "/opt/venv/lib/python3.9/site-packages/click/core.py", line 754, in invoke
        return __callback(*args, **kwargs)
      File "/opt/venv/lib/python3.9/site-packages/prefect/cli/execute.py", line 53, in flow_run
        result = client.graphql(query)
      File "/opt/venv/lib/python3.9/site-packages/prefect/client/client.py", line 548, in graphql
        result = <http://self.post|self.post>(
      File "/opt/venv/lib/python3.9/site-packages/prefect/client/client.py", line 451, in post
        response = self._request(
      File "/opt/venv/lib/python3.9/site-packages/prefect/client/client.py", line 737, in _request
        response = self._send_request(
      File "/opt/venv/lib/python3.9/site-packages/prefect/client/client.py", line 602, in _send_request
        response = <http://session.post|session.post>(
      File "/opt/venv/lib/python3.9/site-packages/requests/sessions.py", line 590, in post
        return self.request('POST', url, data=data, json=json, **kwargs)
      File "/opt/venv/lib/python3.9/site-packages/requests/sessions.py", line 542, in request
        resp = self.send(prep, **send_kwargs)
      File "/opt/venv/lib/python3.9/site-packages/requests/sessions.py", line 655, in send
        r = adapter.send(request, **kwargs)
      File "/opt/venv/lib/python3.9/site-packages/requests/adapters.py", line 516, in send
        raise ConnectionError(e, request=request)
    requests.exceptions.ConnectionError: HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Max retries exceeded with url: /graphql (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f7e839886d0>: Failed to establish a new connection: [Errno 111] Connection refused'))
    Hello there! Are there any updates about this?
    Anna Geller

    Anna Geller

    8 months ago
    Hi Gabriel, I don’t see any obvious issues what may be wrong. I think you may try to explicitly pass agent labels to your KubernetesRun, that’s what I meant with #5.
    say_hello_flow.run_config = KubernetesRun(image=constants.DOCKER_IMAGE.value, labels=["emd"])
    Additionally, perhaps you can try adding a custom Kubernetes job template and pass it to KubernetesRun as well? The error seems to be a permission issue. Perhaps it can be solved by attaching a service account inside of your flow’s job template? Sorry if I’m not too helpful here, but this is a bit complex DevOps & networking issue with this multi-cluster Instio-injected Server deployments. I’ll share your issue, maybe others from engineering can help more.
    Gabriel Milan

    Gabriel Milan

    8 months ago
    As I'm launching it from the UI, I'm adding the
    emd
    label to it. I'll try that with the job template. I'd be glad to hear from others too, but I appreciate your help so far. Thank you!
    just an update from our side: we've managed to do it by switching to linkerd instead of istio. it seems like istio-proxy sidecar would take too long to setup and it wasn't enough for jobs to connect to the apollo server. as linkerd has https://github.com/linkerd/linkerd-await it was easy to set it up with our jobs. it finally works! thanks again!
    Anna Geller

    Anna Geller

    8 months ago
    thanks for the update! and great you figured that out