Hello, I was hoping to get some clarification on w...
# prefect-community
a
Hello, I was hoping to get some clarification on what Kubernetes infra errors Prefect is able to handle gracefully. For example, this thread and this issue indicate that flows/tasks running on a pod that crashes are left stuck in a running state. In that case, which types of errors does the Lazarus service help with? And is there an equivalent or any improvements with such failures with Prefect 2. How can I ensure that there will be no issues with my workflow if I am using spot instances?
c
Generally , I wouldn’t really advise using spot instances for your data flows
spot instances are great for scaling web and stateless applications, but for stateful processes, you can just lose your work and results
a
But if we have checkpointing enabled and the flows/tasks as restarted upon pod failure, I assume we should be able to recover right ? Would you be able to shed some more light on in which cases our workflows may fail entirely ? For more context, these jobs are numerous but not super critical so the occasional total failure is acceptable.
c
If you have checkpointing and cacheing done, and you don’t need to persist the results locally to the pod at all, then I suppose that would be a valid solution
Regarding specifically the jobs stuck in running state while the scheduler attempts to spin up a new one, is still a work in progress for v2 though based on those isssues linked
There’s not currently an easy solution to mitigate, as the infrastructure layer will spin up a new pod, while the scheduler considers the job still running
a
So if I understand correctly, the Lazarus service with Prefect 1 attempts to restart the flow in the event of a pod failure. If checkpointing is enabled it will continue from the last task, otherwise it will start again with the first task In the case where it is unable to do so, we will have an issue with a flow stuck in a running state. With Prefect 2, the agent itself tries to restart the flow. In the event of a failure, it has the same error as above where the flow will be stuck in a running state. Could you confirm if my understanding is correct?