Hi We have a Prefect server version 2022 04 14 deployed on a Prefect Community #prefect-server

Hi, We have a Prefect server (version 2022.04.14) ...

Ron Meshulam

04/26/2022, 12:16 PM

Hi, We have a Prefect server (version 2022.04.14) deployed on a K8s cluster in GCP with Helm. I'm trying to run a flow (flow of flows) for my tests - multiple times (with different inputs). So the situation is that I'm triggering 11 flows that each of them is running a maximum of 3 flows (these flows have 3-4 tasks) Sometimes I get this error:

requests.exceptions.ReadTimeout: HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Read timed out. (read timeout=15)

I thought this was a scale issue so I've replicated the services as follows: • agent: 3 • UI: 2 • apollo: 3 • graphql: 2 • hasura: 2 • towel: 2 we are using an external Postgress (GCP managed) I've seen in earlier massages that I should configure: 1. PREFECT__CLOUD__REQUEST_TIMEOUT = 60 (configured it on the env in the apollo pod) 2. PREFECT_SERVER__TELEMETRY__ENABLED = false (configured it on the env in the agent pod) 3. PREFECT__CLOUD__HEARTBEAT_MODE = thread (configured it on the env in the agent pod) I've attached the values.yaml Can anyone help me what else can I do/ what could be the problem?

values.yaml

Anna Geller

04/26/2022, 1:08 PM

Thanks for this detailed description! I think you've done everything right on the infrastructure side. So the problem is that you get timeout errors when running 33 flow runs in parallel on your Kubernetes agent? This should be easily doable with Prefect Server without having to scale out Apollo. Also, you need only one Kubernetes agent since the actual execution happens within separate pods, as explained in this Discourse topic.

Ron Meshulam

04/26/2022, 1:12 PM

Yes (something like 40± prefect jobs in parallel)

Anna Geller

04/26/2022, 1:12 PM

What happens when you remove the replicas definition and just go with the default setup from here?

Ron Meshulam

04/26/2022, 1:14 PM

I will test and update you.

👍 1

Ron Meshulam

04/26/2022, 2:03 PM

I'm getting the same error 😞

Ron Meshulam

04/26/2022, 2:04 PM

BTW the error is:

HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Read timed out. (read timeout=15)

it doesn't need to be 60?

Kevin Kho

04/26/2022, 2:27 PM

I don’t think that env variable should be on the Apollo pod because the Prefect Python Client uses it when querying, so it will be on the agent or where the Flow run is running

Ron Meshulam

04/26/2022, 2:33 PM

As you can see I've added it to the agent, it didn't have any effect so I've tried to put it in the apollo pod. You are saying that it should be on the job template for the flow?

Kevin Kho

04/26/2022, 2:38 PM

Ah yeah because the agent won’t pass the env variable to the Flow automatically unless you defined it with the

--env

flag so you can try passing it to the flow through the RunConfig.

Copy code

flow.run_config = KubernetesRun(..., env={"PREFECT__CLOUD__REQUEST_TIMEOUT": 60})

Ron Meshulam

04/26/2022, 2:40 PM

adding it to the job_template won't work?

Kevin Kho

04/26/2022, 2:43 PM

It will if you add that env variable to the job template also for the Flow. Does the Flow work if the requests goes through?

Ron Meshulam

04/26/2022, 2:45 PM

That's my problem, I'm not sure what is the problem, because sometimes it's working and sometimes not. I'm trying now to add it to the flow job_template and update you 🙏

Anna Geller

04/26/2022, 2:47 PM

sometimes it's working and sometimes not.

my favorite types of problems 😂

Anna Geller

04/26/2022, 2:47 PM

we should make it a meme

Anna Geller

04/26/2022, 2:49 PM

it would be good to figure out the root cause of this timeout. Are you passing a large payload e.g. to your parameter tasks? Can you share some flow code?

Ron Meshulam

04/26/2022, 2:50 PM

unfortunately, I can't share the code. But can you define large?

Anna Geller

04/26/2022, 2:51 PM

https://discourse.prefect.io/t/what-is-the-api-request-payload-limit-in-prefect-cloud/316

Ron Meshulam

04/26/2022, 2:51 PM

I'm also noticing I forget to share that the flow of flows are running with Dask Executor if it's making any difference.

Anna Geller

04/26/2022, 2:51 PM

you can redact your flow for security or build a sharable/reproducible example that results in the same error

Anna Geller

04/26/2022, 2:53 PM

this definitely makes a difference since Dask has an entirely separate scheduler and execution plane. Perhaps switching to LocalDaskExecutor is worth trying? Using Dask on a parent flow doesn't make much sense, Dask could be more helpful in child flows

Ron Meshulam

04/26/2022, 3:00 PM

Sorry. I'll look into the LocalDaskExecutor

Anna Geller

04/26/2022, 3:07 PM

no need to apologize, those are all valid use cases!

Ron Meshulam

04/26/2022, 3:33 PM

@Khen Price

Khen Price

04/26/2022, 3:37 PM

regarding the payload question, will double check, but I’m fairly confident we don’t submit any parameter even close to that size, it’s mostly booleans and uuids passed around.

👍 1

Kevin Kho

04/26/2022, 4:00 PM

I personally am wondering if the other Apollo pods are being used. Are you confident they are? Maybe you can try one pod with more resources too? I think the requests are getting bottlenecked

Ron Meshulam

04/26/2022, 7:35 PM

I've already tried to remove the scaling, but it's still happening. But for now, we will investigate the LocalDask tomorrow and update you. Thanks a lot 🙂

davzucky

04/26/2022, 10:18 PM

Can I ask a dummy question? I can see you scaled the number of instances of the pods as

Copy code

agent: 3
UI: 2
apollo: 3
graphql: 2
hasura: 2
towel: 2

• Are all these pods behind a K8s Service to ensure the routing is properly done? • Are you agent connecting public Url or local k8s URL ?

upvote 1

Ron Meshulam

04/27/2022, 6:56 AM

yes they are behind a service. Not really understand what you meant.

Ron Meshulam

04/27/2022, 6:56 AM

This is the deployment through the helm chart. I didn't change any definition beside the values.

Anna Geller

04/27/2022, 2:37 PM

What davzucky means (I think) is that having more replicas doesn't necessarily ensure that they get used, there must be some load balancer service to ensure the requests are distributed somehow across those replicas By default both Hasura and GraphQL have ClusterIP as service type - to load balance, you would likely need Load Balancer tye instead (I haven't tried that myself)

Ron Meshulam

04/28/2022, 7:28 AM

But the service today (clusterIP) lets me have "load balancing" (of some sort) and not exposing the services outside of the cluster, and if I'll change it to external load balancer then the traffic will go out and in from the cluster, this would add a layer of more latency. So my question is why would I want to change it and what would be the benefit of it (except for more reliable load balancing options)?

Ron Meshulam

04/28/2022, 7:28 AM

@Noam polak

Anna Geller

04/28/2022, 12:40 PM

Unfortunately, I don't know enough about managing Kubernetes infrastructures to help you out in detail. We have paid service with infrastructure experts for that. you could book via cs@prefect.io. But coming back to the original issue, it looks like 40 flow runs running in parallel shouldn't be a huge scaling problem that caused this error:

Copy code

HTTPConnectionPool(host='prefect-apollo.prefect', port=4200): Read timed out. (read timeout=15)

it could be some network latency issue. Can you say more where do you run this and what's your networking setup?

Ron Meshulam

04/28/2022, 12:57 PM

Anna just wanted to thank you very much for the quick responses and the attention. We are currently testing the flows without the Dask at all, and it looks like it solved the original issue, but we got a new problem:

Copy code

request to <http://prefect-hasura.prefect:3000/v1alpha1/graphql> failed, reason: connect ECONNREFUSED 10.10.11.76:3000

so I'm trying to replicate the GraphQL pod because I think that he can't handle everything. It seems like he can't handle the workload. (After that I'm planning to replicate the Hasura pod also). Can you share your thoughts about how and when to scale up the GraphQL and Hasura pods? Doe's they have some metrics that I can look at them in order to auto-scale them?

Ron Meshulam

04/28/2022, 2:12 PM

Sorry I missed it. it is still happening.

Ron Meshulam

04/28/2022, 2:13 PM

About the networking, Private GKE cluster, nothing special deploying with helm.

Ron Meshulam

04/28/2022, 2:16 PM

By the way, In our production environment, we have the same setup with a different Prefect version. Production - 2021.07.06 Testing - 2022.04.14 In production env, we are not getting these errors. And it's running perfectly (for now - we had the same workload a few days ago with 12 executions in parallel).

Kevin Kho

04/28/2022, 4:31 PM

We don’t really advice on more detailed setups like this with high availability because at that point, we just suggest going to Prefect Cloud instead. On the working in prod, but not in testing, that is pretty hard to guess but that connection refused request really points to a networking issue. You’d have to get time with Customer Success cuz we don’t dive deep into those setups because there are too many requests and they take a lot of time. But also I really personally can’t help with that.

👍 1

Anna Geller

04/29/2022, 1:20 AM

Can you share your thoughts about how and when to scale up the GraphQL and Hasura pods? Doe's they have some metrics that I can look at them in order to auto-scale them?

hard to say; perhaps you could somehow measure the CPU and network utilization of the underlying instance? this is how e.g. AWS seems to trigger autoscaling policies

Production - 2021.07.06

Testing - 2022.04.14

In production env, we are not getting these errors. And it's running perfectly (for now - we had the same workload a few days ago with 12 executions in parallel).

This error actually implies that something may be wrong in your Hasura setup, since the testing version is using Hasura 2.0. I'd definitely cross-check that. I've added some ideas for debugging similar issues here - perhaps those can help you too - keep us posted how it goes!

Anna Geller

04/29/2022, 1:21 AM

cc @Ron Meshulam

Ron Meshulam

05/01/2022, 8:30 AM

OK, thanks a lot for all the help and guidance. I'll update soon as possible.

👍 1

11 Views

Open in Slack

Previous Next