https://prefect.io logo
#prefect-kubernetes
Title
# prefect-kubernetes
k

Kevin Grismore

01/13/2023, 2:47 PM
Hi Prefect people! I have a flow that starts around 200 subflows (all a single deployment with different parameters) as kubernetes jobs on a single work queue with a concurrency of 10. I'm running into two issues: First, a few of the subflows never start and are in either a pending or crashed state with either no logs or a single log about downloading flow code. I've increased the timeout window to 10 minutes for pods to start but that doesn't seem to help. Second, some long-running flows have their state changed to crashed with exit status -1 even though they're still running. When they reach their next task, they crash with logs about that task being already finished. Both the agent and the flows are on 2.7.7. Been stuck on this for a few days and would love some advice!
Following up here, I realized that the
KubernetesJob
block I was using was created from an older version of Prefect, in which
Job Watch Timeout Seconds
had a different default value and description. I created a new block and left the default
None
. I suspect this may have been related to the jobs appearing as crashed while still running, so I'm running again with the new block.
Nope, crashed state while still running again
Here's all the info I've collected about this. The flow is running normally, but the agent logs stop showing activity from the flow for ~30 minutes. Then the job did not complete error follows:
Copy code
22:03:50.966 | INFO    | Task run 'query_es-07038a84-0' - Processed 13249000 records...
22:03:52.824 | INFO    | Task run 'query_es-07038a84-0' - Processed 13250000 records...
22:03:57.345 | INFO    | Task run 'query_es-07038a84-0' - Processed 13251000 records...
22:35:15.572 | ERROR   | prefect.infrastructure.kubernetes-job - Job 'pcv-delivery-sync-and-transform-ktz7t': Job did not complete.
22:35:15.690 | INFO    | prefect.agent - Reported flow run '51ee9c5f-6dce-4f64-a660-52b6d069129b' as crashed: Flow run infrastructure exited with non-zero status code -1.
However, the flow logs (and logs from the pod the flow is running in) continue to report successfully:
Copy code
22:03:50.966 | INFO    | Task run 'query_es-07038a84-0' - Processed 13249000 records...
22:03:52.824 | INFO    | Task run 'query_es-07038a84-0' - Processed 13250000 records...
22:03:57.345 | INFO    | Task run 'query_es-07038a84-0' - Processed 13251000 records...
22:03:59.067 | INFO    | Task run 'query_es-07038a84-0' - Processed 13252000 records...
22:03:59.725 | INFO    | Task run 'query_es-07038a84-0' - Processed 13253000 records...
22:04:01.140 | INFO    | Task run 'query_es-07038a84-0' - Processed 13254000 records...
and so on
Without really understanding what's going on here it seems like the agent stops hearing events from the pod, and despite the timeout being set to
None
, it eventually concludes the flow has crashed or otherwise become unreachable.
Eventually, the flow proceeds to the next task:
Copy code
Created task run 'cloud_storage_upload_blob_from_file-9a1bf371-0' for task 'cloud_storage_upload_blob_from_file'
06:52:45 PM

Executing 'cloud_storage_upload_blob_from_file-9a1bf371-0' immediately...
06:52:45 PM
but the task reports as already finished:
Copy code
Task run '9951b058-0cde-48b3-b7ff-abb0f82a1bc4' already finished.
06:52:46 PM
cloud_storage_upload_blob_from_file-9a1bf371-0
then we get the MissingResult error:
Copy code
prefect.exceptions.MissingResult: State data is missing. Typically, this occurs when result persistence is disabled and the state has been retrieved from the API.
n

Nate

01/17/2023, 10:53 PM
Hi @Kevin Grismore if you're still having trouble with this, (sorry about the delayed response, we had a long weekend as a company) do you know how long it takes for flow runs to crash? Wondering in particular if this could be related to this issue we've noticed with crashes after 4 hours of watching kubernetes resources
k

Kevin Grismore

01/22/2023, 3:38 PM
Yep, this looks like the one. Looking forward to any changes and thanks for staying on top of it! We have an alternative for running these jobs for now 🙂
2 Views