l

    Lukas N.

    9 months ago
    Does
    CloudFlowRunner
    support graceful shutdown? We're running flows as Kubernetes jobs on EC2 spot instances, which get terminated from time to time. Let's say the job starts a pod
    a
    on a node that terminates. Kubernetes will quickly reschedule it and spawn a pod
    b
    , but
    b
    does nothing because the state of the tasks is
    Running
    , but they are not (
    a
    is dead). We need to wait for the heartbeat to timeout for Prefect to reschedule it which takes a long time. Instead setting something like
    Pending
    or even
    Failed
    to tasks that are
    Running
    when a SIGTERM is received would be nice.
    Kevin Kho

    Kevin Kho

    9 months ago
    Hey @Lukas N., just confirming you are on server?
    To answer the question though, Flows update their state by hitting the Prefect API. If the compute quite literally dies, there is no update being sent. So if a pod literally crashes, there is no compute that can hit the API and update the Flow. But I think in general we have a related situation we are aware of that is on the roadmap to be fixed. If you cancel your Flow from the UI, you can get pods or a cluster left turned on because the exit is not graceful. In this case, we may need something on the agent that is responsible for handling the shutdown. I think in this situation, that same mechanism may be able to be used to change the state (not 100% sure). I think what we should do here is create a feature request and reference that other issue as they may be related. Will take care of it. In the meantime though, I am not sure there is anything to help you. If you are on Cloud though, you might be able to turn off Version Locking for the Flow. In Server, we actually have complaints that work is duplicated (which I think is more expected, especially with Dask) because the work is restarted. Version Locking in Cloud stops it from re-running
    l

    Lukas N.

    9 months ago
    Yes, I'm on server. I'm slightly familiar with the state update through API and my issue is slightly different from the UI cancel case. In my case, the compute dies, but it is not sudden. The situation is more like this dialogue: resource manager: Hey prefect flow run, I know you're here, but I need your CPU and memory resources, so I'm gonna terminate you in a bit, but don't worry, I'm gonna move you to a different machine. (sends SIGTERM signal) prefect flow run: ok understandable, I'm gonna stop any computation and tell the API i'm no longer running them. (terminates itself) resource manager: starts another prefect flow run on a different machine prefect flow run: Happily resumes computation
    What happens now is more like this: resource manager: Hey prefect flow run, I know you're here, but I need your CPU and memory resources, so I'm gonna terminate you in a bit, but don't worry, I'm gonna move you to a different machine. (sends SIGTERM signal) prefect flow run: just dies, doesn't notify the API at all resource manager: starts another prefect flow run on a different machine prefect flow run: Hey this is confusing, the API says the task is running but I'm not running anything, better kill myself. (dies)
    Kevin Kho

    Kevin Kho

    9 months ago
    That last bit is quite surprising. It should duplicate the work instead of not do anything. I understand what your are describing. I can open a ticket for this, but yeah I don’t think there’s a workaround at the moment
    Feel free to add details (if you want to copy paste your description that you outlined here)
    Anna Geller

    Anna Geller

    9 months ago
    @Lukas N. This process is a bit involved but AWS sends a termination notice request before the spot instance gets terminated. You could use it to run a lambda function in response to this CloudWatch termination notice event. This lambda function could e.g. query for flow runs running on this "spot agent" which are in a Running state for suspiciously long, then setting their states to Cancelled and for each of them creating a new flow run on an agent that has recourses (e.g. agent on the same machine as your Server), passing a different agent label to the run configuration of "create_flow_run" so that each of those zombie flow run gets restarted on this "more reliable" agent. It's a bit complex but may be worth it if you use spot instances long term in your architecture. https://aws.amazon.com/blogs/compute/taking-advantage-of-amazon-ec2-spot-instance-interruption-notices/
    l

    Lukas N.

    9 months ago
    Thank you @Anna Geller for the info. I'll have a look at what calls are done by the cancel button, I think I may use that snippet as a base for this functionality. 👍