< Marvin> My ECS Worker crashed and now I have tens of flow Prefect Community #ask-marvin

<@ULVA73B9P> My ECS Worker crashed and now I have ...

Baker

02/29/2024, 3:53 PM

@Marvin My ECS Worker crashed and now I have tens of flow in Pending state. My queue is now stuck. How do I cancel all these pending flows? Or relaunch them?

Marvin

02/29/2024, 3:53 PM

To manage flows that are stuck in the "Pending" state, you can cancel them or attempt to relaunch them using various methods provided by Prefect: 1. Canceling Pending Flows: You can cancel a scheduled or in-progress flow run from the CLI, UI, REST API, or Python client. When cancellation is requested, the flow run is moved to a "Cancelling" state. For work pool-based deployments with a worker, the worker will detect the cancellation request and signal the flow run infrastructure to terminate the run. - Manage Flows - Cancel Flow Runs 2. Relaunching Pending Flows: The documentation does not explicitly state how to relaunch pending flows, but you could attempt to do so by first canceling the pending flows as described above and then triggering new flow runs manually or programmatically using the Prefect APIs. If you'd like more detailed information or specific commands/examples on how to perform these actions, please let me know.

Nate

02/29/2024, 3:55 PM

oh cmon @Marvin find the "Rescheduling late flow runs" example

Marvin

02/29/2024, 3:55 PM

It seems I might have slightly missed the mark here; the example provided is not specific to "Rescheduling late flow runs." It's a basic flow example from Prefect. However, I will attempt to find the proper example for rescheduling late flow runs for you. Please give me a moment to use another search method to obtain the correct information. I'm sorry, but the search didn't return the exact example for "Rescheduling late flow runs." The documentation excerpts provided here discuss schedules and scheduler service settings, but they don't include a direct example of rescheduling late flow runs. I would recommend checking Prefect's documentation on schedules and potentially combining the information with Prefect's API or CLI commands to achieve your goal of rescheduling late flow runs. If you could provide me with additional context or information, I could try to assist you further. Alternatively, you may consider searching manually in the Prefect Documentation or reaching out to their community or support for more detailed guidance on this particular task.

Nate

02/29/2024, 3:56 PM

🙄

Nate

02/29/2024, 3:56 PM

https://docs.prefect.io/latest/guides/using-the-client/?h=using+client

Baker

02/29/2024, 3:57 PM

Thanks. That is a lot of code to do a clean up 😕

Nate

02/29/2024, 3:59 PM

the example isnt exactly what you want in this case, you can likely just delete by deployment id we're planning on adding a UI first way of just discarding / bulk deleting flow runs like this

Baker

02/29/2024, 4:02 PM

Ok, I could fix the queue, thanks

Baker

02/29/2024, 4:02 PM

The root cause is a bit concerning though

Baker

02/29/2024, 4:03 PM

The worker died because I overloaded it, then a new worker was spun up, but it didn't catch on the flows, they became pending.

Baker

02/29/2024, 4:03 PM

Then, all new flows became Late, while the pending were blocking the queue.

Nate

02/29/2024, 4:05 PM

then a new worker was spun up, but it didn't catch on the flows

this sounds like the problem, are you sure the new worker was listening to the same pool?

Baker

02/29/2024, 4:05 PM

Ha it happened again 😕 Worker died, a new one appeared, and some flows stays pending.

Baker

02/29/2024, 4:05 PM

Yes, I can see the workers listed in the pool:

Baker

02/29/2024, 4:05 PM

The dead ones and the new one.

Nate

02/29/2024, 4:05 PM

ahh i see ECS is restarting it for you - that is a bit weird that its not picking up the previously submitted ones

Baker

02/29/2024, 4:07 PM

Maybe because these flows talk only to the worker who created them? That's a flow stuck in Pending:

Nate

02/29/2024, 4:09 PM

sorry, i can look at this more in a bit. @Kevin Grismore if you have a sec, have you seen this on ECS?

The worker died because I overloaded it, then a new worker was spun up, but it didn't catch on the flows, they became pending.

Baker

02/29/2024, 4:23 PM

I wonder what I am doing wrong but I keep killing these workers.

Baker

02/29/2024, 4:23 PM

I did double CPU and memory, but still

Baker

02/29/2024, 4:23 PM

I thought they were not supposed to do much work, I wonder how I overload them.

Baker

02/29/2024, 4:32 PM

I think I found a fix. I turned off that circuit breaker from the Worker service:

Baker

02/29/2024, 4:32 PM

Perhaps it was too sensitive. For the moment it run fine. I will have more workload coming in 15min to confirm.

Kevin Grismore

02/29/2024, 4:33 PM

hey I was on a call, but yeah, this is a memory usage issue related to starting many deployment runs at once that will actually be fixed in the next release of

prefect-aws

Kevin Grismore

02/29/2024, 4:35 PM

and you're correct, the worker submits the ecs task to run and then watches it, so if that worker crashes, then when it's restarted it's no longer watching those runs that it already submitted to ecs, leaving them to appear in a pending state

Baker

02/29/2024, 4:37 PM

I'm surprised these workers are being loaded. I thought they were "just" passing info to ECS cluster.

Baker

02/29/2024, 4:38 PM

The tutorial advise to use a small machine for the worker, but I guess it must be tailored to the number of simultaneous tasks you expect.

Kevin Grismore

02/29/2024, 4:40 PM

your understanding is accurate - the ECS worker as it is now has an issue where the AWS client objects it creates for handing off jobs to ECS are duplicative and use up a bunch of memory per run that it's responsible for. are you starting lots of flow runs at the same time?

Baker

02/29/2024, 4:41 PM

yes, a hundred, I need to impress my colleagues 😄

Kevin Grismore

02/29/2024, 4:41 PM

100 is the exact number I tested to get it to crash too. let me point you to the PR that fixes this so you can see a complete explanation!

Kevin Grismore

02/29/2024, 4:42 PM

https://github.com/PrefectHQ/prefect-aws/pull/375

Baker

02/29/2024, 4:47 PM

Awesome, when do we get it??

Nate

02/29/2024, 4:47 PM

I'm about the cut the release 🙂

🎉 1

👑 1

Nate

02/29/2024, 5:04 PM

v4.10.0 is released with the client caching!

pip install -U prefect-aws

or specifying

>=0.4.10

in your dockerfile should now give you the updated lib

Baker

02/29/2024, 7:36 PM

Conclusively put to the test, thank you both! 🙏

Kevin Grismore

02/29/2024, 7:36 PM

yes!!!

Nate

02/29/2024, 7:36 PM

catjam

Baker

02/29/2024, 7:42 PM

any way to remove the dead workers?

22 Views

Open in Slack

Previous Next