Does `CloudFlowRunner` support graceful shutdown We re runni Prefect Community #prefect-server

Does `CloudFlowRunner` support graceful shutdown? ...

Lukas N.

12/03/2021, 6:51 PM

Does

CloudFlowRunner

support graceful shutdown? We're running flows as Kubernetes jobs on EC2 spot instances, which get terminated from time to time. Let's say the job starts a pod

on a node that terminates. Kubernetes will quickly reschedule it and spawn a pod

, but

does nothing because the state of the tasks is

Running

, but they are not (

is dead). We need to wait for the heartbeat to timeout for Prefect to reschedule it which takes a long time. Instead setting something like

Pending

or even

Failed

to tasks that are

Running

when a SIGTERM is received would be nice.

Kevin Kho

12/03/2021, 6:56 PM

Hey @Lukas N., just confirming you are on server?

Kevin Kho

12/03/2021, 7:05 PM

To answer the question though, Flows update their state by hitting the Prefect API. If the compute quite literally dies, there is no update being sent. So if a pod literally crashes, there is no compute that can hit the API and update the Flow. But I think in general we have a related situation we are aware of that is on the roadmap to be fixed. If you cancel your Flow from the UI, you can get pods or a cluster left turned on because the exit is not graceful. In this case, we may need something on the agent that is responsible for handling the shutdown. I think in this situation, that same mechanism may be able to be used to change the state (not 100% sure). I think what we should do here is create a feature request and reference that other issue as they may be related. Will take care of it. In the meantime though, I am not sure there is anything to help you. If you are on Cloud though, you might be able to turn off Version Locking for the Flow. In Server, we actually have complaints that work is duplicated (which I think is more expected, especially with Dask) because the work is restarted. Version Locking in Cloud stops it from re-running

Lukas N.

12/03/2021, 7:29 PM

Yes, I'm on server. I'm slightly familiar with the state update through API and my issue is slightly different from the UI cancel case. In my case, the compute dies, but it is not sudden. The situation is more like this dialogue: resource manager: Hey prefect flow run, I know you're here, but I need your CPU and memory resources, so I'm gonna terminate you in a bit, but don't worry, I'm gonna move you to a different machine. (sends SIGTERM signal) prefect flow run: ok understandable, I'm gonna stop any computation and tell the API i'm no longer running them. (terminates itself) resource manager: starts another prefect flow run on a different machine prefect flow run: Happily resumes computation

Lukas N.

12/03/2021, 7:32 PM

What happens now is more like this: resource manager: Hey prefect flow run, I know you're here, but I need your CPU and memory resources, so I'm gonna terminate you in a bit, but don't worry, I'm gonna move you to a different machine. (sends SIGTERM signal) prefect flow run: just dies, doesn't notify the API at all resource manager: starts another prefect flow run on a different machine prefect flow run: Hey this is confusing, the API says the task is running but I'm not running anything, better kill myself. (dies)

Kevin Kho

12/03/2021, 7:35 PM

That last bit is quite surprising. It should duplicate the work instead of not do anything. I understand what your are describing. I can open a ticket for this, but yeah I don’t think there’s a workaround at the moment

Kevin Kho

12/03/2021, 7:55 PM

Issue is here: https://github.com/PrefectHQ/prefect/issues/5198

Kevin Kho

12/03/2021, 7:56 PM

Feel free to add details (if you want to copy paste your description that you outlined here)

Anna Geller

12/03/2021, 9:54 PM

@Lukas N. This process is a bit involved but AWS sends a termination notice request before the spot instance gets terminated. You could use it to run a lambda function in response to this CloudWatch termination notice event. This lambda function could e.g. query for flow runs running on this "spot agent" which are in a Running state for suspiciously long, then setting their states to Cancelled and for each of them creating a new flow run on an agent that has recourses (e.g. agent on the same machine as your Server), passing a different agent label to the run configuration of "create_flow_run" so that each of those zombie flow run gets restarted on this "more reliable" agent. It's a bit complex but may be worth it if you use spot instances long term in your architecture. https://aws.amazon.com/blogs/compute/taking-advantage-of-amazon-ec2-spot-instance-interruption-notices/

Lukas N.

12/06/2021, 8:17 AM

Thank you @Anna Geller for the info. I'll have a look at what calls are done by the cancel button, I think I may use that snippet as a base for this functionality. 👍

🙌 1

7 Views

Open in Slack

Previous Next