I'm not exactly sure what fixed it but I have been handling this in multiple ways cause I have had different issues popping up.
1) as a safety I setup an automation to rerun all failed crashed flows to retry.
2) i was running on ecs and the image I was using would cause issues while waiting on multiple sockets. The error was too many open files. I was running this concurrently. I changed the tasktunner to use dask(didn't do anything fancy, just daskrunner(adapt something something). And this fixed that.
3) I kept getting race conditions and getting locked without any logs running tasks concurrently and this was fixed too when I changed to dask.
2) n 3) basically fixed this (meaning dask executor fixed it) and 1) helped me get there.
I know I don't really say what caused the issue 😅