Hi Prefect people I have a flow that starts around 200 subfl Prefect Community #prefect-kubernetes

Hi Prefect people! I have a flow that starts aroun...

Kevin Grismore

01/13/2023, 2:47 PM

Hi Prefect people! I have a flow that starts around 200 subflows (all a single deployment with different parameters) as kubernetes jobs on a single work queue with a concurrency of 10. I'm running into two issues: First, a few of the subflows never start and are in either a pending or crashed state with either no logs or a single log about downloading flow code. I've increased the timeout window to 10 minutes for pods to start but that doesn't seem to help. Second, some long-running flows have their state changed to crashed with exit status -1 even though they're still running. When they reach their next task, they crash with logs about that task being already finished. Both the agent and the flows are on 2.7.7. Been stuck on this for a few days and would love some advice!

Kevin Grismore

01/13/2023, 7:03 PM

Following up here, I realized that the

KubernetesJob

block I was using was created from an older version of Prefect, in which

Job Watch Timeout Seconds

had a different default value and description. I created a new block and left the default

None

. I suspect this may have been related to the jobs appearing as crashed while still running, so I'm running again with the new block.

Kevin Grismore

01/13/2023, 10:50 PM

Nope, crashed state while still running again

Kevin Grismore

01/13/2023, 11:31 PM

Here's all the info I've collected about this. The flow is running normally, but the agent logs stop showing activity from the flow for ~30 minutes. Then the job did not complete error follows:

Copy code

22:03:50.966 | INFO    | Task run 'query_es-07038a84-0' - Processed 13249000 records...
22:03:52.824 | INFO    | Task run 'query_es-07038a84-0' - Processed 13250000 records...
22:03:57.345 | INFO    | Task run 'query_es-07038a84-0' - Processed 13251000 records...
22:35:15.572 | ERROR   | prefect.infrastructure.kubernetes-job - Job 'pcv-delivery-sync-and-transform-ktz7t': Job did not complete.
22:35:15.690 | INFO    | prefect.agent - Reported flow run '51ee9c5f-6dce-4f64-a660-52b6d069129b' as crashed: Flow run infrastructure exited with non-zero status code -1.

However, the flow logs (and logs from the pod the flow is running in) continue to report successfully:

Copy code

22:03:50.966 | INFO    | Task run 'query_es-07038a84-0' - Processed 13249000 records...
22:03:52.824 | INFO    | Task run 'query_es-07038a84-0' - Processed 13250000 records...
22:03:57.345 | INFO    | Task run 'query_es-07038a84-0' - Processed 13251000 records...
22:03:59.067 | INFO    | Task run 'query_es-07038a84-0' - Processed 13252000 records...
22:03:59.725 | INFO    | Task run 'query_es-07038a84-0' - Processed 13253000 records...
22:04:01.140 | INFO    | Task run 'query_es-07038a84-0' - Processed 13254000 records...

and so on

Kevin Grismore

01/13/2023, 11:34 PM

Without really understanding what's going on here it seems like the agent stops hearing events from the pod, and despite the timeout being set to

None

, it eventually concludes the flow has crashed or otherwise become unreachable.

Kevin Grismore

01/13/2023, 11:56 PM

Eventually, the flow proceeds to the next task:

Copy code

Created task run 'cloud_storage_upload_blob_from_file-9a1bf371-0' for task 'cloud_storage_upload_blob_from_file'
06:52:45 PM

Executing 'cloud_storage_upload_blob_from_file-9a1bf371-0' immediately...
06:52:45 PM

but the task reports as already finished:

Copy code

Task run '9951b058-0cde-48b3-b7ff-abb0f82a1bc4' already finished.
06:52:46 PM
cloud_storage_upload_blob_from_file-9a1bf371-0

then we get the MissingResult error:

Copy code

prefect.exceptions.MissingResult: State data is missing. Typically, this occurs when result persistence is disabled and the state has been retrieved from the API.

Nate

01/17/2023, 10:53 PM

Hi @Kevin Grismore if you're still having trouble with this, (sorry about the delayed response, we had a long weekend as a company) do you know how long it takes for flow runs to crash? Wondering in particular if this could be related to this issue we've noticed with crashes after 4 hours of watching kubernetes resources

Kevin Grismore

01/22/2023, 3:38 PM

Yep, this looks like the one. Looking forward to any changes and thanks for staying on top of it! We have an alternative for running these jobs for now 🙂

4 Views

Open in Slack

Previous Next