Hi all, I'm testing Prefect 3.1.12 using docker-co...
# ask-community
l
Hi all, I'm testing Prefect 3.1.12 using docker-compose and when I shut it down I end up with some flows that are stuck in the "Running" state forever. Specifically, I have some flows with long running tasks, and a separate process that runs them periodically as a deployment. I also have a process worker pulling from the work queue and running the jobs. When I start it all up everything works as expected, but if I shut it down while a flow is running and then start it back up again (which happens frequently during development), the running flow gets stuck in the Running state forever and eats up a spot in the concurrency limit, effectively reducing the concurrency of my system. Is there a way to ensure that when I shut down a worker, any tasks it was running go to the Crashed state or something like that?
a
Hey Lee! Great question @Chris White and I were talking about this just a few days ago - I’ll let him weigh in here
c
Our current recommendation for these situations is to setup a zombie-killer automation as described here: https://docs.prefect.io/v3/automate/events/automations-triggers#detect-and-respond-to-zombie-flows It was actually an intentional design decision to not couple submitted work to worker health as a form of fault tolerance, but honestly this decision doesn't make much sense for the Process Worker specifically because the work is in fact coupled; I think we can look to add an attempt to cancel subprocesses when that particular worker shuts down gracefully which result in the outcome you're looking for. Best way to help us achieve this would be to file an enhancement request on github so that you can also track status of the work (feel free to copy / paste my response there) but I'll also add it as a TODO for myself
l
Thanks, that helps. So is this problem non-existent (or at least far less likely) with other kinds of workers, like kubernetes?
c
essentially yea, in the sense that the job the worker submits will continue to run to completion independently of whether the worker is still running or not. Other platforms have other "crash" modes (such as unexpected node evictions in k8s) but they should be much less likely than what you're running into
l
Got it. This gives me a workaround for now and I'll see about creating a ticket.
thank you 1
c
Thank you so much!