Hi, We have a Prefect server (version 2022.04.14) ...
# prefect-server
r
Hi, We have a Prefect server (version 2022.04.14) deployed on a K8s cluster in GCP with Helm. I'm trying to run a flow (flow of flows) for my tests - multiple times (with different inputs). So the situation is that I'm triggering 11 flows that each of them is running a maximum of 3 flows (these flows have 3-4 tasks) Sometimes I get this error:
requests.exceptions.ReadTimeout: HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Read timed out. (read timeout=15)
I thought this was a scale issue so I've replicated the services as follows: • agent: 3 • UI: 2 • apollo: 3 • graphql: 2 • hasura: 2 • towel: 2 we are using an external Postgress (GCP managed) I've seen in earlier massages that I should configure: 1. PREFECT__CLOUD__REQUEST_TIMEOUT = 60 (configured it on the env in the apollo pod) 2. PREFECT_SERVER__TELEMETRY__ENABLED = false (configured it on the env in the agent pod) 3. PREFECT__CLOUD__HEARTBEAT_MODE = thread (configured it on the env in the agent pod) I've attached the values.yaml Can anyone help me what else can I do/ what could be the problem?
a
Thanks for this detailed description! I think you've done everything right on the infrastructure side. So the problem is that you get timeout errors when running 33 flow runs in parallel on your Kubernetes agent? This should be easily doable with Prefect Server without having to scale out Apollo. Also, you need only one Kubernetes agent since the actual execution happens within separate pods, as explained in this Discourse topic.
r
Yes (something like 40± prefect jobs in parallel)
a
What happens when you remove the replicas definition and just go with the default setup from here?
r
I will test and update you.
👍 1
I'm getting the same error 😞
BTW the error is:
HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Read timed out. (read timeout=15)
it doesn't need to be 60?
k
I don’t think that env variable should be on the Apollo pod because the Prefect Python Client uses it when querying, so it will be on the agent or where the Flow run is running
r
As you can see I've added it to the agent, it didn't have any effect so I've tried to put it in the apollo pod. You are saying that it should be on the job template for the flow?
k
Ah yeah because the agent won’t pass the env variable to the Flow automatically unless you defined it with the
--env
flag so you can try passing it to the flow through the RunConfig.
Copy code
flow.run_config = KubernetesRun(..., env={"PREFECT__CLOUD__REQUEST_TIMEOUT": 60})
r
adding it to the job_template won't work?
k
It will if you add that env variable to the job template also for the Flow. Does the Flow work if the requests goes through?
r
That's my problem, I'm not sure what is the problem, because sometimes it's working and sometimes not. I'm trying now to add it to the flow job_template and update you 🙏
a
sometimes it's working and sometimes not.
my favorite types of problems 😂
we should make it a meme
it would be good to figure out the root cause of this timeout. Are you passing a large payload e.g. to your parameter tasks? Can you share some flow code?
r
unfortunately, I can't share the code. But can you define large?
r
I'm also noticing I forget to share that the flow of flows are running with Dask Executor if it's making any difference.
a
you can redact your flow for security or build a sharable/reproducible example that results in the same error
this definitely makes a difference since Dask has an entirely separate scheduler and execution plane. Perhaps switching to LocalDaskExecutor is worth trying? Using Dask on a parent flow doesn't make much sense, Dask could be more helpful in child flows
r
Sorry. I'll look into the LocalDaskExecutor
a
no need to apologize, those are all valid use cases!
r
@Khen Price
k
regarding the payload question, will double check, but I’m fairly confident we don’t submit any parameter even close to that size, it’s mostly booleans and uuids passed around.
👍 1
k
I personally am wondering if the other Apollo pods are being used. Are you confident they are? Maybe you can try one pod with more resources too? I think the requests are getting bottlenecked
r
I've already tried to remove the scaling, but it's still happening. But for now, we will investigate the LocalDask tomorrow and update you. Thanks a lot 🙂
d
Can I ask a dummy question? I can see you scaled the number of instances of the pods as
Copy code
agent: 3
UI: 2
apollo: 3
graphql: 2
hasura: 2
towel: 2
• Are all these pods behind a K8s Service to ensure the routing is properly done? • Are you agent connecting public Url or local k8s URL ?
upvote 1
r
yes they are behind a service. Not really understand what you meant.
This is the deployment through the helm chart. I didn't change any definition beside the values.
a
What davzucky means (I think) is that having more replicas doesn't necessarily ensure that they get used, there must be some load balancer service to ensure the requests are distributed somehow across those replicas By default both Hasura and GraphQL have ClusterIP as service type - to load balance, you would likely need Load Balancer tye instead (I haven't tried that myself)
r
But the service today (clusterIP) lets me have "load balancing" (of some sort) and not exposing the services outside of the cluster, and if I'll change it to external load balancer then the traffic will go out and in from the cluster, this would add a layer of more latency. So my question is why would I want to change it and what would be the benefit of it (except for more reliable load balancing options)?
@Noam polak
a
Unfortunately, I don't know enough about managing Kubernetes infrastructures to help you out in detail. We have paid service with infrastructure experts for that. you could book via cs@prefect.io. But coming back to the original issue, it looks like 40 flow runs running in parallel shouldn't be a huge scaling problem that caused this error:
Copy code
HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Read timed out. (read timeout=15)
it could be some network latency issue. Can you say more where do you run this and what's your networking setup?
r
Anna just wanted to thank you very much for the quick responses and the attention. We are currently testing the flows without the Dask at all, and it looks like it solved the original issue, but we got a new problem:
Copy code
request to <http://prefect-hasura.prefect:3000/v1alpha1/graphql> failed, reason: connect ECONNREFUSED 10.10.11.76:3000
so I'm trying to replicate the GraphQL pod because I think that he can't handle everything. It seems like he can't handle the workload. (After that I'm planning to replicate the Hasura pod also). Can you share your thoughts about how and when to scale up the GraphQL and Hasura pods? Doe's they have some metrics that I can look at them in order to auto-scale them?
Sorry I missed it. it is still happening.
About the networking, Private GKE cluster, nothing special deploying with helm.
By the way, In our production environment, we have the same setup with a different Prefect version. Production - 2021.07.06 Testing - 2022.04.14 In production env, we are not getting these errors. And it's running perfectly (for now - we had the same workload a few days ago with 12 executions in parallel).
k
We don’t really advice on more detailed setups like this with high availability because at that point, we just suggest going to Prefect Cloud instead. On the working in prod, but not in testing, that is pretty hard to guess but that connection refused request really points to a networking issue. You’d have to get time with Customer Success cuz we don’t dive deep into those setups because there are too many requests and they take a lot of time. But also I really personally can’t help with that.
👍 1
a
Can you share your thoughts about how and when to scale up the GraphQL and Hasura pods? Doe's they have some metrics that I can look at them in order to auto-scale them?
hard to say; perhaps you could somehow measure the CPU and network utilization of the underlying instance? this is how e.g. AWS seems to trigger autoscaling policies
Production - 2021.07.06
Testing - 2022.04.14
In production env, we are not getting these errors. And it's running perfectly (for now - we had the same workload a few days ago with 12 executions in parallel).
This error actually implies that something may be wrong in your Hasura setup, since the testing version is using Hasura 2.0. I'd definitely cross-check that. I've added some ideas for debugging similar issues here - perhaps those can help you too - keep us posted how it goes!
cc @Ron Meshulam
r
OK, thanks a lot for all the help and guidance. I'll update soon as possible.
👍 1