<@ULVA73B9P> I am running a Kubernetes cluster to ...
# ask-marvin
a
@Marvin I am running a Kubernetes cluster to run Prefect jobs and running Prefect agent 2.19. Occasionally, flow runs get stuck in "Pending" state and never start running. Restarting the Prefect agent pod gets new flows running again, but flows that were stuck in "Pending" never start running again. In the agent logs, I see logs like this referencing the stuck runs:
Copy code
22:58:48.048 | INFO    | prefect.agent - Submitting flow run '574e5e45-bc92-4d22-a8d3-f42c13fd70e8'
Normally, I think these logs should be followed by something like:
Copy code
22:58:48.597 | INFO    | prefect.infrastructure.kubernetes-job - Job 'pi788-aludra-void-z2h6k': Pod has status 'Pending'.
22:58:48.679 | INFO    | prefect.agent - Completed submission of flow run '574e5e45-bc92-4d22-a8d3-f42c13fd70e8'
But for the stuck runs, there is no following log that says
Completed submission of flow run <id>
. We tried upgrading agent from 2.10 to 2.19, but that didn't fix the issue. CPU is not high (<1%) and memory usage is low (<400MB) on the agent pod. I also don't see any errors in the agent logs around the time the runs start getting stuck. What could cause this issue to happen?
m
It seems that your issue with Prefect flow runs getting stuck in the "Pending" state is not uncommon, and there are several potential causes and solutions to explore: 1. Infrastructure and Configuration: Ensure that your Kubernetes setup is correctly configured for Prefect. The Prefect agent should have proper access to the Kubernetes API, and your KubernetesJob infrastructure should be configured correctly. Check the KubernetesJob requirements for more details. 2. Concurrency and Resource Limits: As seen in this GitHub issue, concurrency limits and resource configurations can affect flow execution. Review your flow's concurrency settings and ensure that there are no conflicts or resource constraints that might be causing the issue. 3. Networking and API Issues: Network issues or API endpoint misconfigurations can lead to retries or failures in flow submission. In one GitHub issue, a retryable exception was encountered due to a protocol error, which might be similar to your situation. Ensure that the Prefect server or API endpoint is stable and accessible. 4. Agent and Worker Lifecycle: From another GitHub issue, it was noted that if a worker is destroyed and replaced, flows in process might not transition properly. Ensure that your agents and workers are stable and not being frequently replaced or restarted. 5. Logs and Debugging: Continue monitoring logs for any anomalies or patterns. The absence of "Completed submission" logs suggests a submission issue. Check for any network or API errors around the time of submission. If these suggestions do not resolve the issue, consider reaching out to the Prefect community or support for more tailored assistance.