Tim Galvin
12/10/2024, 1:46 AM.map
submissions in a loop. I found that without imposing my own sleep
in the for loop anyio
starts to get unstable and issues errors around unmanagable TaskGroups. The sleep
has outright eliminated this issue, but I have no idea of other side effects this has, or simply better ways.
For (2) I am scaling the flow up to 1.5k workers. I am in a HPC facility and really trying to make a point with some other groups. From a file system I/O perspective it looks like the load would support 5k dask workers (on the order of 200 nodes), but it seems like my current prefect server can not handle the load (8 cores/64GB, though it is the cores that are the problem). I have NO IDEA on the model used in prefect to communicate with the server over its restful api, so this question may not make sense. But, is there a way to rate limit communication to the server? Just to help smooth out the peak load a littleTim Galvin
12/10/2024, 2:10 PM01:09:08.281 | ERROR | Flow run 'furry-mammoth' - Crash detected! Execution was interrupted by an unexpected exception: ExceptionGroup: unhandled errors in a TaskGroup (42 sub-exceptions)
Tom Jordahl
01/06/2025, 4:12 PMTim Galvin
01/06/2025, 11:10 PMPREFECT_API_REQUEST_TIMEOUT
.
I ran some tests and was able to see some fun behaviour where I examined the timestamp for each of the "Created task my_task-123` logged output lines for my workflow. My workflow had a tweakable 'sleep' stagger step to try to control the rate that tasks were created with the serve. The dash-dash line represents the 'best we can do'considering this stagger delay. The solid line represents the actual time taken. If there was no overhead with the submission and registration of the task the solid line would be the same as the dash-dash.
On the whole more httpx connections made the system more stable for me (i.e. no TaslGroup exceptinos) and submitted things 5x faster.
When the max connection was too small and/or the timeout was to low, I think some attempts to register the task time out, and are reissued with a cooldown period. This is what I think is happening in the first plot.Tom Jordahl
01/06/2025, 11:40 PMTom Jordahl
01/06/2025, 11:41 PMTim Galvin
01/07/2025, 12:05 AM