Hi < Prefect> devs Are there any docs on scaling up the self Prefect Community #ask-community

Hi <@U021J8TU753> devs: Are there any docs on sca...

Tim Galvin

11/09/2023, 3:57 AM

Hi @Prefect devs: Are there any docs on scaling up the self-hosted solution? When I say scaling up I mean "I have 16 CPUs" type scaling up. I am also running a postgres database on the same machine. I am using prefect in a HPC setting, and I am noticing strange issues, among with are super slow UI, tasks finishing (i.e. saying that that are now in the Complete() state) and then restarting, and extreme load on a single CPU on my server. I estimate that I have around 35 flows * 36 dask workers that could be running at any one point. Depending on how the data flows are triggered these 35 flows might trickle in, or start running all at once. it depends on what the SLURM scheduler does in terms of allocating the main flow compute. I have found the uvicorn

WEB_CONCURRENCY

item, which le

Copy code

export WEB_CONCURRENCY=16
export PREFECT_SQLALCHEMY_POOL_SIZE=5
export PREFECT_SQLALCHEMY_MAX_OVERFLOW=10

My problem now though is that it is pretty easy to get the prefect server to crash with asyncio error referring to an excessive number of clients:

sorry, too many clients already

I can mitigate this a little by slowing down the number of flows submitted to the SLURM scheduler - this is OK so long as my jobs are instantly allocated resources. But once my requests hit the SLURM scheduler I have no control over when and how many are started simulatenously. So, I am left coming back to trying to understand the interplay and best practises around these asyncio clients, and these above variables. I am not sure whether these POOL_SIZE / MAX_OVERFLOW are a per process item (as uvicorn starts a number of workers to handling the incoming rest API traffic) or whether they are global across all processes. The latter doesn't make too much sense ot me, but I am questioning everything. Are there any devs or community members who have insights on these variables, their interplay and this asyncio 'too many clients already' issue?

9 Views

Open in Slack

Previous Next