Hi, I've been using Prefect with KubernetesRun con...
# ask-community
a
Hi, I've been using Prefect with KubernetesRun config on GCP for a few months now and it's been working well. Recently we created a new node pool that uses A100 GPUs and they can sometimes take a while to become available (e.g.
The zone 'projects/<project>/zones/<zone>' does not have enough resources available to fulfill the request
, and we've seen them take >10m for the node pool to finish scaling an instance up even when they're available. My understanding is that the Lazarus process will treat a flow that hasn't started running for more than 10m as a failure and trigger another flow run - however this is the behavior i've been seeing: Flow A run -> Zone is out of resources -> Flow B run by Lazarus after 10m -> Flow C run by Lazarus after 10m -> After 30m, Lazarus marks the flow as failed (
A Lazarus process attempted to reschedule this run 3 times without success. Marking as failed.
) -> Pods for flows A-C are still stuck in Pending state on Kubernetes and have to be manually deleted Given this use case would you recommend disabling the Lazarus process altogether for these flows? The ideal behavior for us would be for the flow to wait until an instance can be scaled up, even if it takes a few hours. It would be nice if we could specify a time limit also. Also, is it expected for there to be "zombie" Kubernetes pods/jobs left over in a case like this, and are there any recommended ways to deal with that? I'm not sure what would happen if resources suddenly became available after the Lazarus process stopped all the flows, but before we find them and manually clean them up - would they still run even though the flow has been failed in Prefect? Ideally once a flow is failed we'd like any pending pods/jobs for that flow to be deleted automatically, not sure if that's possible.
I have some pods left over from a flow run that failed this way due to resources not being available, will try to see what happens to them if/when an instance becomes available
k
So this is the first time I’ve encountered this situation. I think yes you might be better off turning off Lazarus, and then maybe try using an Automation to cancel the Flow past a certain period of time? I think the pods pending is expected because Prefect hasn’t even been able to start a Flow in this case. I think if the Flow Run is failed or cancelled, these won’t run because if ever the Flow starts, it will see the Failed/Cancelled state when it hits the API. Deletion of these pods is currently not possible. there might need to be something on the agent level to handle it, which we dont have a concept of yet. I think this is better off as a Github issue
a
Thanks Kevin. I monitored the pods stuck in Pending and saw that one showed as
Completed
and the others had a
CreateContainerError
a few hours later - then they were removed from the pod list. at the same time the node pool autoscaled several instances up, and a few mins later they scaled down. looks like they moved from
Pending
to a finished state pretty quickly after the autoscaling happened which makes sense if they hit the API and saw they didn't need to be run.
best case scenario would be the node pool should not autoscale up at all if the Prefect run is cancelled, i think we'd need to delete the pod for that. I can make a Github issue for this 👍
k
Thanks for the update!