< alex> starting a discussion thread for <https github com P Prefect Community #ask-community

Join Slack

<@U02GE39K551> starting a discussion thread for <h...

# ask-community

Lev Zemlyanov

10/01/2025, 10:12 PM

@alex starting a discussion thread for https://github.com/PrefectHQ/prefect/issues/18933

🙌 1

Lev Zemlyanov

10/01/2025, 11:02 PM

We currently are horizontally scaling the prefect server, but we don't use redis and we aren't super keen on configuring it. What do you think about some version of this?

/get_scheduled_flow_runs

(we can make a new endpoint i.e

/get_assigned_scheduled_flow_runs

) which gives ownership to the worker and reassigns any flows given to a dead worker (identified by the heartbeat)

Copy code

UPDATE flow_run
SET worker_id = 'xxx'
WHERE id IN (SELECT id FROM flow_run LEFT JOIN worker_heartbeat on worker_id = worker_heartbeat.worker_id WHERE flow_run.status = 'scheduled' or (flow_run.status = 'pending' and worker_is_alive = false ) LIMIT 100);

This has the added benefit of letting workers horizontally scale without stepping on each other's toes - something we are having trouble with in K8S.

alex

10/02/2025, 1:57 PM

I think you're going to run into issue eventually if you scale horizontally without using Redis. We use Redis for caching and event ordering when recording task run state, so you can run into DB deadlocks if you run multiple servers without Redis. I think keeping this info in the DB will put a lot of strain on the DB at high volumes. The polling query is already expensive, so I'm wary of adding more to it. If I remember correctly, you're regularly running thousands of flows at once, so I don't think this will scale to that level. I really think that if you want to operate at that scale, you'll need to use Redis. Can you elaborate more about workers stepping on each other's toes?

Lev Zemlyanov

10/02/2025, 9:33 PM

I wasn't aware that Redis could help with DB load so that makes for a compelling argument. We will work on setting that up this week and then I will proceed with implementing your proposal. For the workers, we see these logs. Essentially, multiple workers poll the server and get a similar set of flow runs to schedule. They then compete to mark the flows as pending and submit them to K8S. This problem is worse, the more workers you have.

Copy code

Worker 'KubernetesWorker 0effa744-0226-4eb0-b718-8f016beac43f' submitting flow run 'b21c392d-3f0a-4ede-b6d3-1756ffc54716'
Worker 'KubernetesWorker 6de4ecc3-90d7-45ef-9ca5-8df8bd924c1e' submitting flow run 'b21c392d-3f0a-4ede-b6d3-1756ffc54716'
Aborted submission of flow run 'b21c392d-3f0a-4ede-b6d3-1756ffc54716'. Server sent an abort signal: This run is in a PENDING state and cannot transition to a PENDING state.
Creating Kubernetes job...

Lev Zemlyanov

10/08/2025, 3:53 AM

@alex we tested a Redis setup for ourselves and we do have some concerns. We did observe that CPU load on the DB was reduced by about a factor of 2, however the redis instance was at 100% CPU usage even for a subset of our entire workload (1.5K deployed flows). We tried using a Redis cluster, but we got CrossSlot exceptions as Prefect does not support sharded Redis. We are concerned that Redis is unable to scale to meet our demands with the current setup. Is our understanding correct or are we missing something? If our understanding is correct, using Redis to solve this ticket will probably not help us.

alex

10/08/2025, 1:13 PM

What's the size of the Redis instance that you're using?

Lev Zemlyanov

10/08/2025, 5:19 PM

we used the largest available on GCP (XLARGE_HIGHMEM) which is 8 cores, 58GB of memory. memory usage was extremely minimal by comparison.

Lev Zemlyanov

10/10/2025, 8:46 PM

@alex bump. any path forward? I would love to start work on the issue next week, but we would like some confidence that Redis is the right choice for us

alex

10/10/2025, 8:51 PM

That level of CPU load is surprising. What's the read/write volume look like when the Redis CPU is at 100%?

Lev Zemlyanov

10/10/2025, 8:51 PM

Screenshot 2025-10-07 at 2.36.58 PM.png

Lev Zemlyanov

10/10/2025, 8:51 PM

100:1 write to read

Lev Zemlyanov

10/10/2025, 8:52 PM

the first bump is 1/3 of our load, the second bump is our entire load

alex

10/10/2025, 8:55 PM

How many server replicas are you running? I'm wondering if we can improve our Redis connection handling to reduce some of the load.

Lev Zemlyanov

10/10/2025, 8:55 PM

We have autoscaling on with each server allocated 1 cpu and we can spin up to 200 servers/CPUs in one run

alex

10/10/2025, 9:01 PM

Ah, the received connection doesn't seem that wild then. We could look into clustering support. We use Redis streams for events, and I'm not sure if that's supported in clustering. All that being said, the execution time doesn't look too worrying despite being at max CPU.

Lev Zemlyanov

10/10/2025, 9:57 PM

I agree it performed quite well, but we are interesting in scaling farther past our current point 🙂 We are not Redis experts, so we are unsure what will happen as we keep pushing it further. Based on some rudimentary research, it doesn't seem like clustering prohibits streams, one just needs to be careful about multistream reads. If clustering support was added, that would make us a lot more comfortable moving forward.

Lev Zemlyanov

10/18/2025, 12:59 AM

We have thought about this some more with the team. We are not comfortable running Redis at 100% CPU without any scaling guarantees and it does not sound like clustering support is a priority on the roadmap any time soon. We have decided to pursue implementing a solution outside of the framework, so I will unfortunately have to withdraw from the implementation of this fix at this time.

alex

10/20/2025, 2:07 PM

I definitely think it would be worthwhile to support Redis clustering for high-scale use cases like yours. Any chance you'd be will to help enable cluster support for

prefect-redis

Lev Zemlyanov

10/20/2025, 6:34 PM

I would be interested in taking a look. As I understand, there are a lot of "CrossSlot" errors when using a Redis cluster, which implies that a lot of the redis queries will have to be rewritten to not query across instances in a cluster, which may be substantial work.

alex

10/20/2025, 7:58 PM

Yeah, the Lua scripts will probably require the most work. I might have some time to dig into that this week.

2 Views

Open in Slack

Previous Next