Abyaya Lamsal
03/24/2025, 5:41 PM2.14.16
to 2.20.16
. I started seeing some intermittent issues with a subset of flows. This seems to happen randomly; hence, not every flow run has this issue. For reference, I am using a custom image. Attached logs below:
13:29:55.058 | INFO | prefect.flow_runs.worker - Worker 'KubernetesWorker a5d26a51-ff36-4697-8daf-f8aa3a0fea54' submitting flow run '855ead39-db97-4fa6-85b0-723ddd90b7c8'
13:29:55.236 | INFO | prefect.flow_runs.worker - Creating Kubernetes job...
13:29:55.314 | INFO | prefect.flow_runs.worker - Completed submission of flow run '855ead39-db97-4fa6-85b0-723ddd90b7c8'
13:29:55.349 | INFO | prefect.flow_runs.worker - Job 'adept-hog-hwcjq': Pod has status 'Pending'.
13:30:55.327 | ERROR | prefect.flow_runs.worker - Job 'adept-hog-hwcjq': Pod never started.
13:30:55.570 | INFO | prefect.flow_runs.worker - Pod event 'Scheduled' at 2025-03-18 13:29:55+00:00: Successfully assigned [OUR_NAMESPACE]/adept-hog-hwcjq-pqhbc to <INTERNAL_NODE>
13:30:55.571 | INFO | prefect.flow_runs.worker - Job event 'SuccessfulCreate' at 2025-03-18 13:29:55+00:00: Created pod: adept-hog-hwcjq-pqhbc
13:30:55.572 | INFO | prefect.flow_runs.worker - Pod event 'Pulling' at 2025-03-18 13:29:56+00:00: Pulling image "<CUSTOM_IMAGE>"
13:30:55.572 | INFO | prefect.flow_runs.worker - Pod event 'Pulled' at 2025-03-18 13:30:33+00:00: Successfully pulled image "<CUSTOM_IMAGE>" in 37.16s (37.16s including waiting). Image size: <SIZE> bytes.
13:30:55.716 | INFO | prefect.flow_runs.worker - Reported flow run '855ead39-db97-4fa6-85b0-723ddd90b7c8' as crashed: Flow run infrastructure exited with non-zero status code -1.
<NORMAL EXECUTION>
...
The job eventually runs. The issue is if I subscribe to any failure notification, then I get randomly bombarded with crash notifications, which is not very helpful. Would appreciate any pointers here. Here is a sample of the job logs:Abyaya Lamsal
03/25/2025, 4:16 PM