<@ULVA73B9P> My ECS Worker crashed and now I have ...
# ask-marvin
b
@Marvin My ECS Worker crashed and now I have tens of flow in Pending state. My queue is now stuck. How do I cancel all these pending flows? Or relaunch them?
m
To manage flows that are stuck in the "Pending" state, you can cancel them or attempt to relaunch them using various methods provided by Prefect: 1. Canceling Pending Flows: You can cancel a scheduled or in-progress flow run from the CLI, UI, REST API, or Python client. When cancellation is requested, the flow run is moved to a "Cancelling" state. For work pool-based deployments with a worker, the worker will detect the cancellation request and signal the flow run infrastructure to terminate the run. - Manage Flows - Cancel Flow Runs 2. Relaunching Pending Flows: The documentation does not explicitly state how to relaunch pending flows, but you could attempt to do so by first canceling the pending flows as described above and then triggering new flow runs manually or programmatically using the Prefect APIs. If you'd like more detailed information or specific commands/examples on how to perform these actions, please let me know.
n
oh cmon @Marvin find the "Rescheduling late flow runs" example
m
It seems I might have slightly missed the mark here; the example provided is not specific to "Rescheduling late flow runs." It's a basic flow example from Prefect. However, I will attempt to find the proper example for rescheduling late flow runs for you. Please give me a moment to use another search method to obtain the correct information. I'm sorry, but the search didn't return the exact example for "Rescheduling late flow runs." The documentation excerpts provided here discuss schedules and scheduler service settings, but they don't include a direct example of rescheduling late flow runs. I would recommend checking Prefect's documentation on schedules and potentially combining the information with Prefect's API or CLI commands to achieve your goal of rescheduling late flow runs. If you could provide me with additional context or information, I could try to assist you further. Alternatively, you may consider searching manually in the Prefect Documentation or reaching out to their community or support for more detailed guidance on this particular task.
n
🙄
b
Thanks. That is a lot of code to do a clean up 😕
n
the example isnt exactly what you want in this case, you can likely just delete by deployment id we're planning on adding a UI first way of just discarding / bulk deleting flow runs like this
b
Ok, I could fix the queue, thanks
The root cause is a bit concerning though
The worker died because I overloaded it, then a new worker was spun up, but it didn't catch on the flows, they became pending.
Then, all new flows became Late, while the pending were blocking the queue.
n
then a new worker was spun up, but it didn't catch on the flows
this sounds like the problem, are you sure the new worker was listening to the same pool?
b
Ha it happened again 😕 Worker died, a new one appeared, and some flows stays pending.
Yes, I can see the workers listed in the pool:
The dead ones and the new one.
n
ahh i see ECS is restarting it for you - that is a bit weird that its not picking up the previously submitted ones
b
Maybe because these flows talk only to the worker who created them? That's a flow stuck in Pending:
n
sorry, i can look at this more in a bit. @Kevin Grismore if you have a sec, have you seen this on ECS?
The worker died because I overloaded it, then a new worker was spun up, but it didn't catch on the flows, they became pending.
b
I wonder what I am doing wrong but I keep killing these workers.
I did double CPU and memory, but still
I thought they were not supposed to do much work, I wonder how I overload them.
I think I found a fix. I turned off that circuit breaker from the Worker service:
Perhaps it was too sensitive. For the moment it run fine. I will have more workload coming in 15min to confirm.
k
hey I was on a call, but yeah, this is a memory usage issue related to starting many deployment runs at once that will actually be fixed in the next release of
prefect-aws
and you're correct, the worker submits the ecs task to run and then watches it, so if that worker crashes, then when it's restarted it's no longer watching those runs that it already submitted to ecs, leaving them to appear in a pending state
b
I'm surprised these workers are being loaded. I thought they were "just" passing info to ECS cluster.
The tutorial advise to use a small machine for the worker, but I guess it must be tailored to the number of simultaneous tasks you expect.
k
your understanding is accurate - the ECS worker as it is now has an issue where the AWS client objects it creates for handing off jobs to ECS are duplicative and use up a bunch of memory per run that it's responsible for. are you starting lots of flow runs at the same time?
b
yes, a hundred, I need to impress my colleagues 😄
k
100 is the exact number I tested to get it to crash too. let me point you to the PR that fixes this so you can see a complete explanation!
b
Awesome, when do we get it??
n
I'm about the cut the release 🙂
🎉 1
👑 1
v4.10.0 is released with the client caching!
pip install -U prefect-aws
or specifying
>=0.4.10
in your dockerfile should now give you the updated lib
b
Conclusively put to the test, thank you both! 🙏
k
yes!!!
n
catjam
b
any way to remove the dead workers?