Are there any supported ways to 1 - limit the rate...
# ask-community
t
Are there any supported ways to 1 - limit the rate tasks are created? 2 - apply some rate limiting to the prefect (self-hosted) server? Why I ask. For (1) I have a flow that creates on the order of 22k tasks through a series of
.map
submissions in a loop. I found that without imposing my own
sleep
in the for loop
anyio
starts to get unstable and issues errors around unmanagable TaskGroups. The
sleep
has outright eliminated this issue, but I have no idea of other side effects this has, or simply better ways. For (2) I am scaling the flow up to 1.5k workers. I am in a HPC facility and really trying to make a point with some other groups. From a file system I/O perspective it looks like the load would support 5k dask workers (on the order of 200 nodes), but it seems like my current prefect server can not handle the load (8 cores/64GB, though it is the cores that are the problem). I have NO IDEA on the model used in prefect to communicate with the server over its restful api, so this question may not make sense. But, is there a way to rate limit communication to the server? Just to help smooth out the peak load a little
For example, such an error when (1) happens
Copy code
01:09:08.281 | ERROR   | Flow run 'furry-mammoth' - Crash detected! Execution was interrupted by an unexpected exception: ExceptionGroup: unhandled errors in a TaskGroup (42 sub-exceptions)
t
Hey @Tim Galvin did you look in to the Concurrency limits: https://docs.prefect.io/v3/develop/global-concurrency-limits
t
Hi Tom - happy new year and thanks for checking in. I'd love to chat and try to rubber ducky this a little with folks more familiar with the internals. I toyed around with thi problem before Christmas so it is a little foggy. I have been meaning to raise a github issue around this. In short, I have not used the concurrency limiters. Since the initial question I found the issue was really around the initial task registration. Once all tasks have been registered and submitted for execution the system is pretty stable, and I can scale up to ~2k dask workers pretty reliably. As I understand, these concurrency limits are around managing the running state of tasks, but not so much the task creation. I went digging through the prefect code and found that if I were able to increase the size of the httpx max number of connections my rate of task submission significantly increases, which outright avoids the problem I was seeing. There was also some effect of with the
PREFECT_API_REQUEST_TIMEOUT
. I ran some tests and was able to see some fun behaviour where I examined the timestamp for each of the "Created task my_task-123` logged output lines for my workflow. My workflow had a tweakable 'sleep' stagger step to try to control the rate that tasks were created with the serve. The dash-dash line represents the 'best we can do'considering this stagger delay. The solid line represents the actual time taken. If there was no overhead with the submission and registration of the task the solid line would be the same as the dash-dash. On the whole more httpx connections made the system more stable for me (i.e. no TaslGroup exceptinos) and submitted things 5x faster. When the max connection was too small and/or the timeout was to low, I think some attempts to register the task time out, and are reissued with a cooldown period. This is what I think is happening in the first plot.
t
That’s interesting. Our use case was to avoid hitting rate limits on APIs that our tasks were using so we use the Prefect concurrency limits to accomplish this. Unfortunately when things go wrong sometimes the semaphore doesn’t get released which I believe we worked around by setting a timeout.
So we kind of have the opposite problem in that we need to limit the rate at which our tasks are created
t
Thats right. Having a stable system here could be acheived by controlling the rate tasks are created. This may come at the cost of longer workflow runs. In my circumstance I am running code on a HPC on a problem that is highly parallelisable. Tasks have a normal runtime length of 15 to 5 minutes in most cases. So, for me, I kind of want to understand the root cause for this instability when creating tasks, for the simple reason that the adaptive dask mode could kill workers as the rate of task creation is lower than the rate of task completion! Of course I could also consider reduving the number of individual tasks by packing more work together into one.