https://prefect.io logo
j

Jeff Charbeneau

08/01/2023, 4:51 PM
Hello, all. Since late last week, we have noticed Prefect flows running in k8s showing as "Crashed" in the Prefect Cloud UI. Using k9s to inspect the pods has revealed that the pods were in fact still running. Sometimes the flows have sprung suddenly from "Crashed" to "Running" and completed successfully. One that "Crashed" yesterday apparently still has the associated pod running. What might account for this? What do you recommend we do? Thanks in advance.
n

Nate

08/01/2023, 5:29 PM
hi @Jeff Charbeneau - do you have the pod logs from a run where pods were inappropriately marked as crashed? if you had the worker logs from this time, that would also be helpful also I've assumed you're using the k8s worker here and not the agent - is that true?
j

Jeff Charbeneau

08/01/2023, 5:34 PM
@Nate Thanks for being so responsive. Let me see if I can recreate this behavior and collect some pod logs. I'll also confirm that we're using the Prefect k8s agent and not the k8s worker as I expect.
@Nate In k9s, I see a
prefect-2-agent-*
pod, leading me to believe we're using the Prefect k8s agent, and not the k8s worker. Below are some pod logs from a recent flow run that had the described behavior. In the Prefect UI, the flow changed from "Crashed" to "Running" around the time that this log message was printed.
Copy code
19:21:05.042 | INFO    | Flow run 'phi652-alterf' - Downloading flow code from storage at None
Thanks in advance for your thoughts.
@Nate How much of a chance have you had to look at the logs?
n

Nate

08/03/2023, 4:06 PM
hey @Jeff Charbeneau - afaict it seems like the failure you shared occurs during within the "c4_job_manager"
Copy code
c4_job_manager.c4_error.C4Error: ('Job be7bajuioxg6gvfnhcufc6 failed, detail=, output={}, ', 'For logs, see <https://relativity.splunkcloud.com/en-US/app/search/search?q=search%20index%3Dr1_k8s_logs_prod%20source%3D*be7bajuioxg6gvfnhcufc6*&display.page.search.mode=verbose&dispatch.sample_ratio=1&earliest=-30m%40m&latest=now>')
can you explain what this is? is there some way it would revive a process that prefect has lost touch with (after a Crash of sorts)
j

Jeff Charbeneau

08/03/2023, 4:31 PM
Thanks for taking a look @Nate. You're right, there is a failure in the flow. However, that failure occurs after the flow changes from "Crashed", where it starts, to "Running". Furthermore, we've got retries on that task and are able to recover from the failure. Again, the issue is that the flow starts in a "Crashed" state, the pod is still running, and then the flow state changes to "Running".
n

Nate

08/03/2023, 4:36 PM
hmm sorry if im just missing it, but where are you seeing the transition to Crashed in the logs? if not the logs, is it just in the UI?
j

Jeff Charbeneau

08/03/2023, 4:58 PM
Great question, Nate. "Crashed" appears only in the Prefect UI, not in the logs.