< Marvin> Hi guys I am using prefect cloud and my workpools Prefect Community #ask-marvin

<@ULVA73B9P> Hi guys, I am using prefect cloud and...

Ofek K

08/18/2024, 1:28 PM

@Marvin Hi guys, I am using prefect cloud and my workpools are representing SPOT node pools in GCP k8s, I have a lot of flow runs that are "running forever" when in the k8s cluster the pod doesn't exist anymore, any way to handle it? to send a signal that a spot has been taken and to get a relevant exit code? anyone encountered this problem?

Marvin

08/18/2024, 1:29 PM

### Documentation Insights There are several references to handling Kubernetes-related issues in the Prefect documentation, such as orchestrating Kubernetes pods and setting up Prefect Cloud connections. However, there isn't a specific mention of handling stuck flow runs due to lost pods. - Integrate Kubernetes with Prefect: Reference - Setting up Prefect Cloud API: Reference ### GitHub Insights There is a relevant GitHub issue titled "An error occurred while monitoring Kubernetes flow run" where the agent loses track of the pod, and the flow run becomes stuck in the running state. The details mirror the problem you described. - Issue Details: GitHub Issue #8243 ### Recommendations 1. Monitor Pods Status: Implement a custom monitoring system to regularly check the status of your Kubernetes pods. If a pod is found missing, handle it programmatically. 2. Prefect API: Use the

prefect

CLI or Prefect's API to query the status of your flow runs and mark them as failed if the corresponding pod is no longer available. 3. Error Handling: Adapt the error handling mechanism in your flow to catch such events and terminate or retry gracefully. 4. Retry Logic: Implement retry logic within the Prefect flow to respond to such failures. ### Next Steps Consider creating a watchdog service that interfaces with both your Kubernetes cluster and Prefect API to identify and clean up such orphaned flow runs. If you encounter persistent issues, opening a more detailed support request with Prefect might be necessary. Feel free to reach out if you need further assistance.

3 Views

Open in Slack

Previous Next