https://prefect.io logo
Title
n

Nick Coy

11/07/2022, 9:12 PM
I have noticed that flow runs seem to get stuck in a running state. This seems to be happening more and more frequently. We are using Prefect 2.4.0 and using K8s and GCS for infrastructure. I have found that for these flow runs I am seeing logs on the agent like this
prefect.agent - An error occured while monitoring flow run <flow_run_id> The flow run will not be marked as failed, but an issue may have occurred.
z

Zanie

11/07/2022, 9:32 PM
Hi! Can you share the error from the agent logs? You may need to turn on DEBUG level logging.
n

Nick Coy

11/07/2022, 9:48 PM
Here are the logs for the flow run that got stuck. I don't believe I have DEBUG level logging enabled currently tho
z

Zanie

11/07/2022, 9:57 PM
It looks like the pod was missing when the agent attempted to check the status of it. Did the pod create successfully? Do you have a short pod TTL?
n

Nick Coy

11/07/2022, 9:59 PM
I think the pod created successfully since it was running tasks in the flow. I have
"ttlSecondsAfterFinished": 180
z

Zanie

11/07/2022, 10:06 PM
It’s weird that the cluster would report the pod as missing, those logs are for the flow run itself afterwards? Do you have these logs in a human readable format? 😄
n

Nick Coy

11/07/2022, 10:09 PM
To be clear, the prefect_logs.csv are the logs on the agent for the flow run. you want to see the logs from the job?
z

Zanie

11/07/2022, 10:13 PM
Well the agent forwards logs from the flow run pod to local stdout but I’m surprised that the logs would get forwarded if the agent got a “NotFound” from Kubernetes when it tried to read the pod.
n

Nick Coy

11/07/2022, 10:20 PM
yea, Im not sure why this is occurring. The last log from the pod was that the last task was running
08:32:42.927 | INFO    | Task run 'run_bigquery_query-18a8523b-1' - hl-production-343314.media.pinterest_metrics_core
I found this on another flow run that was stuck running, Im not sure if this helpful
File "/usr/local/lib/python3.10/site-packages/prefect/agent.py", line 227, in _submit_run_and_capture_errors
    result = await infrastructure.run(task_status=task_status)
  File "/usr/local/lib/python3.10/site-packages/prefect/infrastructure/kubernetes.py", line 237, in run
    return await run_sync_in_worker_thread(self._watch_job, job_name)
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 57, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(call, cancellable=True)
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/site-packages/prefect/infrastructure/kubernetes.py", line 432, in _watch_job
    status_code=first_container_status.state.terminated.exit_code,
AttributeError: 'NoneType' object has no attribute 'exit_code'
k

Kristian Andersen Hole

01/23/2023, 7:47 PM
I’m experiencing this a lot as well on K8s. My pods continue running and complete the tasks, but the agent somehow loses track of it at some point before completion and complains about:
An error occured while monitoring flow run {uuid}. The flow run will not be marked as failed, but an issue may have occurred.
This happens in about 1/20 runs. The flows then becomes stuck in the running state causing problems for other runs due to concurrency limits in our pipeline. Would be so grateful for any insight into this issue 🙂 Thanks
z

Zanie

01/23/2023, 8:44 PM
Is a traceback displayed for the error?
If not, I would recommend enabling debug level logs.
k

Kristian Andersen Hole

01/23/2023, 8:45 PM
traceback
z

Zanie

01/23/2023, 9:22 PM
I see, so while we are watching the job the pod is gone
Do you know what’s happening to the pod? 🙂
k

Kristian Andersen Hole

01/23/2023, 10:50 PM
The job-pod continues executing the flow and tasks to complection. (We stream logs to elastic search from pods). In the case above flow completion happened 15 seconds after the agent reported the error. I.e. the pod continued executing the flow with no problems. Also there were 40 seconds between the last log message the agent logged from the pod and the agent throwing the error. (Guess there is a timeout on agents losing contact with pods)
z

Zanie

01/24/2023, 4:23 PM
Would you mind opening an issue with these details and the logs?
It seems unlikely that I’ll be able to resolve it for you here in Slack and we’ll need to dig into it further.
👍 1