<@U02GE39K551> starting a discussion thread for <h...
# ask-community
l
@alex starting a discussion thread for https://github.com/PrefectHQ/prefect/issues/18933
🙌 1
We currently are horizontally scaling the prefect server, but we don't use redis and we aren't super keen on configuring it. What do you think about some version of this?
/get_scheduled_flow_runs
(we can make a new endpoint i.e
/get_assigned_scheduled_flow_runs
) which gives ownership to the worker and reassigns any flows given to a dead worker (identified by the heartbeat)
Copy code
UPDATE flow_run
SET worker_id = 'xxx'
WHERE id IN (SELECT id FROM flow_run LEFT JOIN worker_heartbeat on worker_id = worker_heartbeat.worker_id WHERE flow_run.status = 'scheduled' or (flow_run.status = 'pending' and worker_is_alive = false ) LIMIT 100);
This has the added benefit of letting workers horizontally scale without stepping on each other's toes - something we are having trouble with in K8S.
a
I think you're going to run into issue eventually if you scale horizontally without using Redis. We use Redis for caching and event ordering when recording task run state, so you can run into DB deadlocks if you run multiple servers without Redis. I think keeping this info in the DB will put a lot of strain on the DB at high volumes. The polling query is already expensive, so I'm wary of adding more to it. If I remember correctly, you're regularly running thousands of flows at once, so I don't think this will scale to that level. I really think that if you want to operate at that scale, you'll need to use Redis. Can you elaborate more about workers stepping on each other's toes?
l
I wasn't aware that Redis could help with DB load so that makes for a compelling argument. We will work on setting that up this week and then I will proceed with implementing your proposal. For the workers, we see these logs. Essentially, multiple workers poll the server and get a similar set of flow runs to schedule. They then compete to mark the flows as pending and submit them to K8S. This problem is worse, the more workers you have.
Copy code
Worker 'KubernetesWorker 0effa744-0226-4eb0-b718-8f016beac43f' submitting flow run 'b21c392d-3f0a-4ede-b6d3-1756ffc54716'
Worker 'KubernetesWorker 6de4ecc3-90d7-45ef-9ca5-8df8bd924c1e' submitting flow run 'b21c392d-3f0a-4ede-b6d3-1756ffc54716'
Aborted submission of flow run 'b21c392d-3f0a-4ede-b6d3-1756ffc54716'. Server sent an abort signal: This run is in a PENDING state and cannot transition to a PENDING state.
Creating Kubernetes job...