Hey ! I’d like to share an issue I’m facing with P...
# ask-community
m
Hey ! I’d like to share an issue I’m facing with Prefect that someone that uses Kubernetes might be able to help. I’m registering workflows using Docker storage and the registration + build of the image happens through a CI/CD pipeline that runs on Kubernetes (AWS EKS). The Environment: As you saw above, my CI/CD environment is containerised (a pod on Kubernetes). This means that when registering a flow the prefect register API is building a container inside a container. The issue: The Prefect register command (which builds the image) fails to pull libraries from PyPi throwing connectivity issues (DNS resolution). The Kubernetes pod has internet access + DNS working as I can pull the prefect Library from PyPi for example. But for some reason the registration command where the image is built throws this:
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7ffb221c2160>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /simple/pip/
see the
Temporary failure in name resolution
In summary: I’m not sure how much this is Kubernetes or a Prefect issue. I’m throwing it here as there could be a tweak on Prefect we could do to make it work. Any ideas ?
j
I am also having a similar problem but not sure what exactly causes it. I am running a simple docker image with my python code and while saving the data received from an API into an excel file I am getting the following error message.
[2021-09-13 153545,386] INFO - agent | Process PID 69 returned non-zero exit code
WARNINGurllib3.connectionpoolRetrying (Retry(total=5, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f838499aa00>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /graphql
WARNINGurllib3.connectionpoolRetrying (Retry(total=5, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f83849a5fa0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution')': /graphql
Does anyone have any ideas or know a thing about this strange error? P.S. I am running multiple flows and only one of the flows is throwing this error. All other flows running without an issue.
m
Hey Junaid ! In my case I have a docker image that runs the CI/CD pipeline. The same image I can run locally on my machine, which means I can reproduce changes locally before pushing the to the pipeline. The interesting thing is that it works fine locally for me. However, when running inside EKS is when I get the DNS resolution issue. I also don’t understand why for you it is just happening for a specific flow 🤔
j
Yes, it's strange that only one of the flows is throwing this error message. First, I also thought that it may be some issue with the machine on which docker is running. So I tried on my local but having similar error messages on my local machine as well. Strange that it's working for your local machine but not in my case.
k
Hey @Maikel Penz, I’ve been looking into this but unfortunately I think it’s beyond the immediate knowledge of the team. It seems to be a deep Kubernetes issue unfortunately.
m
Thanks @Kevin Kho, it's a very tricky one. Were you able to reproduce a workflow registration (with docker storage build) inside a Kubernetes pod ? Is there a Prefect/Kubernetes expert you can put me in contact with ?
k
I did not reproduce as the error could be anything from Docker not working to dns issues. This is the best writeup I’ve seen, but it’s probably even more complicated in k8s. Noone on the team immediately knows. I could forward you to the Slate Data team, who specializes in infrastructure and handles Prefect enterprise deployments? Unfortunately, this is less of a Prefect issue.
m
thanks @Kevin Kho our legend @Leandro Mana saved the day finding that the solution is to grant privileged mode for the pods so they can run its own Docker Daemon, DockerInsideDocker, more details on the issue could be found here: https://support.cloudbees.com/hc/en-us/articles/360019236771 cc: @Junaid Usman
k
What a legend lol
l
😆