Aric Huang
01/10/2022, 11:30 PMThe zone 'projects/<project>/zones/<zone>' does not have enough resources available to fulfill the request
, and we've seen them take >10m for the node pool to finish scaling an instance up even when they're available. My understanding is that the Lazarus process will treat a flow that hasn't started running for more than 10m as a failure and trigger another flow run - however this is the behavior i've been seeing:
Flow A run -> Zone is out of resources -> Flow B run by Lazarus after 10m -> Flow C run by Lazarus after 10m -> After 30m, Lazarus marks the flow as failed (A Lazarus process attempted to reschedule this run 3 times without success. Marking as failed.
) -> Pods for flows A-C are still stuck in Pending state on Kubernetes and have to be manually deleted
Given this use case would you recommend disabling the Lazarus process altogether for these flows? The ideal behavior for us would be for the flow to wait until an instance can be scaled up, even if it takes a few hours. It would be nice if we could specify a time limit also.
Also, is it expected for there to be "zombie" Kubernetes pods/jobs left over in a case like this, and are there any recommended ways to deal with that? I'm not sure what would happen if resources suddenly became available after the Lazarus process stopped all the flows, but before we find them and manually clean them up - would they still run even though the flow has been failed in Prefect? Ideally once a flow is failed we'd like any pending pods/jobs for that flow to be deleted automatically, not sure if that's possible.Aric Huang
01/10/2022, 11:40 PMKevin Kho
Aric Huang
01/11/2022, 2:36 AMCompleted
and the others had a CreateContainerError
a few hours later - then they were removed from the pod list. at the same time the node pool autoscaled several instances up, and a few mins later they scaled down. looks like they moved from Pending
to a finished state pretty quickly after the autoscaling happened which makes sense if they hit the API and saw they didn't need to be run.Aric Huang
01/11/2022, 2:37 AMKevin Kho