Hello, We are using self-hosted prefect(2.x) on AWS EKS and want to use spot instances to reduce costs. Trying to figure out how to handle when our spot instances get interrupted. Ideally we would like to resubmit/rerun the flow.
How are others handling interrupts?
I have tried the following without success:
1. on_crashed, on_failure hooks.
2. listening for SIGTERM
Neither of the above two actually triggers the target function.
a
Alessio Civitillo
11/10/2024, 1:05 PM
Have you considered AKS with autoscaling?
p
Peter Peter
11/11/2024, 4:06 PM
Thanks for the response. We are an AWS shop and am sure that spot instances get interrupted in Azure as well. We are using autoscaling with EKS and no issues with that. Issue is that when spot instance gets interrupted the flow runs get terminated. Originally, we thought prefect would rerun when the spot instances are interrupted but this is not the case. This led me to look into SEGTERM/SIGINT to try to handle the interrupts. Not sure how others are handling this case.
a
Alessio Civitillo
11/11/2024, 4:57 PM
My mistake, somehow I translated spot instances to AWS EC2 instances. We also use AWS and I agree that problem would come up also in Azure. We normally had interruptions on big events like a VM being down or DNS issue, in those cases we manually did the reruns, but your case is different
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.