https://prefect.io logo
Title
j

Jars

03/09/2023, 6:12 PM
Our Prefect Cloud 1.0 runs are piling up, and not executing. Is there a Prefect outage?
Our GKE agent is erroring with:
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='<http://api.prefect.io|api.prefect.io>', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f9ab3f9f4e0>: Failed to establish a new connection: [Errno -3] Temporary failure in name resolution',))
j

jawnsy

03/09/2023, 6:30 PM
Thanks for the report! We’re not aware of any issues, https://cloud.prefect.io is loading fine for me right now. The web site you’re using also uses the same API endpoint, so it seems like a DNS resolution error inside your cluster. I’d look at kube-dns and then your provider’s DNS system (e.g. with GKE, I’d check that the metadata service and DNS response policy is all working properly)
j

Jars

03/09/2023, 6:32 PM
Thanks Jawnsy. Our runs just kicked off. So, I think we're in the clear now. However, the Chat button in the top right of Cloud also was blinking red, and clicking it did not open the chat window. I thought that could be further evidence that (parts of) the API were offline
j

jawnsy

03/09/2023, 6:33 PM
Ah, thanks for letting us know! We’ll investigate that issue
n

nicholas

03/09/2023, 8:03 PM
Hi @Jars - thanks for the report, this issue has been resolved
j

Jars

03/09/2023, 8:04 PM
thanks guys
c

Chris White

03/09/2023, 10:22 PM
I've seen that name resolution issue before whenever kubeDNS has issues within your k8s control plane; I'm not the best at debugging k8s but maybe that helps you know where to look for some deeper error logs!
j

Jars

03/09/2023, 11:22 PM
thanks @Chris White, we'll look into that next time it happens. 👍
👍 1
b

Brett

03/10/2023, 1:37 PM
@jawnsy @nicholas Could you share the issue / resolution please?
j

jawnsy

03/10/2023, 1:40 PM
There were two issues here, as I understand it: 1. There was a cluster DNS issue of some kind wherever the agent was running, unrelated to the Prefect Cloud service. 2. Prefect Cloud had an error in our Content-Security-Policy, which prevented the chat bot from working correctly. We updated our policy to resolve that issue.
b

Brett

03/10/2023, 1:41 PM
Okay Thanks! I had a similar problem, I guess I will have to take a look at my cluster DNS.
Actually, this is the error I am getting:
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='<http://api.prefect.io|api.prefect.io>', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7ff92db43370>: Failed to establish a new connection: [Errno 101] Network is unreachable'))
Could this have been because of the CSP as well?
j

jawnsy

03/10/2023, 2:08 PM
No, CSP will only affect browsers and seems unrelated to what you’re seeing here. Do you have other applications running in the same cluster that make outbound connections to the Internet? Some things this could be: 1. NetworkPolicy in Kubernetes that limits egress (possible if you’re running in a hardened environment, but not likely otherwise) 2. Firewall rule blocking egress (not the default in clouds; possible if you’re running in a hardened environment) 3. Missing network tag, resulting in the wrong firewall rules being applied 4. Routing misconfiguration or missing NAT gateway (I think this is uncommon) What cloud are you running in? All of them should offer a Reachability Analyzer tool that will give you more diagnostics
b

Brett

03/10/2023, 2:11 PM
Thanks, this is in Azure. No, it's happening sporadically. We have flows that have been running fine for months, and suddenly they are getting this error, usually for long running tasks. Like 1-2 hours.
But they are sometimes completing as well.
I was able to catch an error from the pod as well just before it stopped.
[2023-03-09 16:30:39+0000] INFO - prefect.CloudTaskRunner | Task 'update_stop_on_failure': Finished task run for task with final state: 'Pending'
/usr/local/lib/python3.8/site-packages/prefect/executors/dask.py:314: RuntimeWarning: coroutine 'rpc.close_rpc' was never awaited
scheduler_comm.close_rpc()
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
[2023-03-09 16:30:40+0000] INFO - prefect.CloudFlowRunner | Flow run RUNNING: terminal tasks are incomplete.
j

jawnsy

03/10/2023, 2:15 PM
Hmm, that’s interesting. Intermittent problems like this are often very tricky to diagnose, unfortunately.
b

Brett

03/10/2023, 2:16 PM
😆 Hence why I'm here. Thanks for giving me some feedback on what to take a look at.
j

jawnsy

03/10/2023, 2:20 PM
Are you running AKS with default container networking? It could be an issue with the CNI plugin, though that’s also unlikely, unless you’re running a very large cluster (GKE limits cluster size so that it stays below their networking scalability limits, and presumably AKS does the same if you’re using their managed offerings) It could also be something weird like a Path MTU setting, or something blocking Path MTU Discovery. This too also seems rare because I think most of us use whatever VPC defaults the provider gives us
b

Brett

03/10/2023, 2:28 PM
As far as I can tell, we're using the default container networking. I don't think my cluster is large. What is a path MTU setting? I'm sorry I'm pretty new to kubernetes.
j

Jars

03/10/2023, 2:34 PM
Our experience is the same as Brett's. We have flows running for very long periods of time, and suddenly we start getting ``HTTPSConnectionPool(host='api.prefect.io', port=443): Max retries exceeded with url:`` errors. I should also mention that, lately, our team also receives these same errors sporadically when building/registering a flow, from their local machines, not from any cluster:
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='<http://api.prefect.io|api.prefect.io>', port=443): Max retries exceeded with url: /graphql (Caused by ReadTimeoutError("HTTPSConnectionPool(host='<http://api.prefect.io|api.prefect.io>', port=443): Read timed out. (read timeout=15)"))
It is usually resolved with a retry after five minutes. It's happened to multiple team members running independent workstations, so no sure about DNS.
:thank-you: 1
j

jawnsy

03/10/2023, 2:35 PM
It’s not a Kubernetes-specific thing but a general networking problem, though I think that’s unlikely too, unless you’ve customized your environment a lot. Networks have a maximum packet size setting called https://en.wikipedia.org/wiki/Maximum_transmission_unit, and larger packets can be dropped if the discovery isn’t working correctly (e.g. the network has a setting of 1000 bytes and you send a packet that is 1500 bytes) Nowadays I think everyone uses a default which is 1490 or 1500 bytes, looks like Azure does as well
thanks for that added info @Jars, I’ll share with the team and we’ll investigate on our end
🙌 2
b

Brett

03/10/2023, 2:35 PM
This was above but is it just a red herring?
/usr/local/lib/python3.8/site-packages/prefect/executors/dask.py:314: RuntimeWarning: coroutine 'rpc.close_rpc' was never awaited
scheduler_comm.close_rpc()
From the actual pod logs just before it got removed.
@jawnsy Thanks! Is there anything we could do on our end to log more information or help with your debugging?
Like logging the packet size maybe? If that's even possible.
j

jawnsy

03/10/2023, 5:58 PM
If it’s happening to multiple people, it seems more likely that there’s a problem on our side somewhere, we’ll need some time to look into it
👍 1
b

Brett

03/10/2023, 7:26 PM
If it helps, it started happening to me on the 8th of March.
:thank-you: 1
@jawnsy Not to be a bother, do you have any more information on this or a ticket we could follow?
j

jawnsy

03/13/2023, 6:19 PM
Hey! I appreciate the follow up. We don’t have a public tracker for cloud issues but if you have a support plan you can open a ticket with our team and track that way
👍 1
b

Brett

03/14/2023, 1:39 PM
Update/Feedback: Looks like it's been at least 24 hours, maybe more since I've gotten this issue.
a

alex

03/16/2023, 4:38 PM
I've being having this issue for some flows starting Match 10 as well
Triggering a quick run from the UI works but scheduled flows are stuck in pending
b

Brett

03/22/2023, 6:25 PM
Update #2: I'm seeing these errors again. Now they are followed by a network error when attempting to connect to our sql server.
j

jawnsy

03/22/2023, 8:40 PM
Thanks for the update Brett and I’m sorry you’re still experiencing issues. We looked at our logs and monitoring, but everything has been inconclusive so far
c

Chris White

03/22/2023, 9:12 PM
If you’re seeing sql server network errors that suggests an internal networking issue
b

Brett

03/23/2023, 5:28 PM
@Chris White I would agree, but it's not that we can't connect, it's that we can't connect after we get this first http network error. It's stabilized since yesterday. From the past two experiences it seems that it the connections are more unstable after we do a deploy to our AKS cluster, but then mellow out over the course of a day to a day and a half.
j

Jars

05/19/2023, 4:20 PM
We are receiving this error, again, from multiple local development workstations when attempting to call `flow.register`:
HTTPSConnectionPool(host='<http://api.prefect.io|api.prefect.io>', port=443): Read timed out.
Is there any request or response ID that we can correlate? What kind of information would you require to look-up this request on the server side? project/flow/image names?
We are running 0.14.22 (I know it's old 🙂 , we are in the process of upgrading to 2.0) Question, would it be related to this timeout setting in config.toml?
j

jawnsy

05/19/2023, 6:29 PM
Timeouts are often a network issue somewhere, and most likely in your infrastructure. For our Cloud service, you should never get a timeout, because there is an internal 30 second timeout, after which a 408 response will be returned if the client timed out or 503 response if the server timed out. We also monitor timeouts on our side. You can try increasing that request timeout but 15 seconds is quite a long time. I’m also not sure what kind of timeout that is (either HTTP request timeout or TCP timeout, and either of those could be the issue here)