Hello Community! I'm having an issue with Prefect ...
# ask-community
i
Hello Community! I'm having an issue with Prefect wrongly concluding a Flow Crashed due to an initial, and very short-lived, unavailability of a Kubernetes node to run the Job containing the Flow on.
Accidentally hit Enter without Shift (Slack)
Copy code
16:18:15.289 | INFO    | prefect.flow_runs.worker - Worker 'KubernetesWorker 69f0259a-aa9d-492f-a735-9041ff25e12c' submitting flow run 'e4c0c8a9-c896-4145-b0e4-61b6259de0b2'
16:18:15.923 | INFO    | prefect.flow_runs.worker - Creating Kubernetes job...16:18:16.166 | INFO    | prefect.flow_runs.worker - Job 'indigo-dinosaur-qs8dx': Pod has status 'Pending'.
16:18:16+00:00: 0/9 nodes are available: 9 node(s) didn't match Pod's node affinity/selector. preemption: 0/9 nodes are available: 9 Preemption is not helpful for scheduling..
16:18:16+00:00: Created pod: indigo-dinosaur-qs8dx-n6vbr
16:18:16.174 | INFO    | prefect.flow_runs.worker - Completed submission of flow run 'e4c0c8a9-c896-4145-b0e4-61b6259de0b2'
16:18:17+00:00: Pod should schedule on: machine/default-dthxn
16:18:51+00:00: Successfully assigned prefect-narwhals/indigo-dinosaur-qs8dx-n6vbr to ip-10-144-37-210.us-west-2.compute.internal
16:19:16.072 | ERROR   | prefect.flow_runs.worker - Job 'indigo-dinosaur-qs8dx': Pod never started.
16:19:16.087 | INFO    | prefect.flow_runs.worker - Pod event 'FailedScheduling' at 2023-12-18
16:19:16.088 | INFO    | prefect.flow_runs.worker - Job event 'SuccessfulCreate' at 2023-12-18
16:19:16.089 | INFO    | prefect.flow_runs.worker - Pod event 'Nominated' at 2023-12-18
16:19:16.089 | INFO    | prefect.flow_runs.worker - Pod event 'Scheduled' at 2023-12-18
16:19:16.090 | INFO    | prefect.flow_runs.worker - Pod event 'Pulling' at 2023-12-18 16:18:51+00:00: Pulling image "<http://730998372749.dkr.ecr.us-west-2.amazonaws.com/tackle-application-customer-access:c2aa66a8ed04689cbfe58c2ae079cb80c738c1b8|730998372749.dkr.ecr.us-west-2.amazonaws.com/tackle-application-customer-access:c2aa66a8ed04689cbfe58c2ae079cb80c738c1b8>"
16:19:16.268 | INFO    | prefect.flow_runs.worker - Reported flow run 'e4c0c8a9-c896-4145-b0e4-61b6259de0b2' as crashed: Flow run infrastructure exited with non-zero status code -1.
The Flow does run without issue after the Node, provided by AWS as an On-Demand EC2 instance, becomes available allowing the Job's Pod to be scheduled. Is there a way have Prefect be more tolerant of an initial FailedScheduling event, or wait longer before assigning a Final State to the Flow?
I adjusted the Pod Watch Timeout setting from 60 to 120 seconds, and the issue appears to have gone away. (after two Flow deployments without issue) Now, however, the last Pod status the Prefect Worker receives on submitted Flows is Running, even though the parent Jobs of those Pods completed successfully each time. (based on kubectl output)