<@ULVA73B9P> I get randomly exit code 137 and flow...
# ask-marvin
o
@Marvin I get randomly exit code 137 and flow is marked as crashed, while when looking at the metrics, the pod is not reaching the it's limits/node limit
m
Exit code 137 typically indicates that the process was terminated by the Linux Out Of Memory (OOM) killer. If you observe that your pod is not reaching its limit or node limit, it might be tricky to diagnose, but here are some potential areas to investigate: 1. Memory Usage Patterns: Even if the pod's average memory usage is below the limit, a spike in memory usage can trigger the OOM killer. Monitoring real-time memory usage patterns might help identify sudden spikes. 2. Node-Level OOM: The entire node may be under memory pressure, causing the OOM killer to act on one of your pods. Ensure that the node itself has enough memory available. 3. CGroup Limits: Kubernetes uses CGroup to limit the memory usage of containers. Ensure that your limits and requests are set appropriately. 4. Prefect's Retries and Failures Handling: If the crash is due to Prefect's handling of retries or state changes (as highlighted in the related GitHub issues), consider adding more logging and state monitoring. 5. Resource Requests and Limits: Make sure the resource requests and limits for your pods are set reasonably. Under-requesting memory might lead to the OOM killer terminating your pods. Here is an example of how you can set resource requests and limits in your deployment YAML:
Copy code
yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: example-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: example-app
  template:
    metadata:
      labels:
        app: example-app
    spec:
      containers:
        - name: example-container
          image: example-image
          resources:
            requests:
              memory: "512Mi"
              cpu: "500m"
            limits:
              memory: "1024Mi"
              cpu: "1000m"
References: - GitHub Issue on flows with "Crashed" subflows and timeout: Issue #7683 - Example of flow errors and handling: State Change Hooks Monitoring these aspects closely should help you identify and resolve the issue. If needed, consider reaching out to Prefect support for more specific guidance.