I have noticed that flow runs seem to get stuck in...
# ask-community
n
I have noticed that flow runs seem to get stuck in a running state. This seems to be happening more and more frequently. We are using Prefect 2.4.0 and using K8s and GCS for infrastructure. I have found that for these flow runs I am seeing logs on the agent like this
prefect.agent - An error occured while monitoring flow run <flow_run_id> The flow run will not be marked as failed, but an issue may have occurred.
z
Hi! Can you share the error from the agent logs? You may need to turn on DEBUG level logging.
n
Here are the logs for the flow run that got stuck. I don't believe I have DEBUG level logging enabled currently tho
z
It looks like the pod was missing when the agent attempted to check the status of it. Did the pod create successfully? Do you have a short pod TTL?
n
I think the pod created successfully since it was running tasks in the flow. I have
"ttlSecondsAfterFinished": 180
z
It’s weird that the cluster would report the pod as missing, those logs are for the flow run itself afterwards? Do you have these logs in a human readable format? 😄
n
To be clear, the prefect_logs.csv are the logs on the agent for the flow run. you want to see the logs from the job?
z
Well the agent forwards logs from the flow run pod to local stdout but I’m surprised that the logs would get forwarded if the agent got a “NotFound” from Kubernetes when it tried to read the pod.
n
yea, Im not sure why this is occurring. The last log from the pod was that the last task was running
08:32:42.927 | INFO    | Task run 'run_bigquery_query-18a8523b-1' - hl-production-343314.media.pinterest_metrics_core
I found this on another flow run that was stuck running, Im not sure if this helpful
Copy code
File "/usr/local/lib/python3.10/site-packages/prefect/agent.py", line 227, in _submit_run_and_capture_errors
    result = await infrastructure.run(task_status=task_status)
  File "/usr/local/lib/python3.10/site-packages/prefect/infrastructure/kubernetes.py", line 237, in run
    return await run_sync_in_worker_thread(self._watch_job, job_name)
  File "/usr/local/lib/python3.10/site-packages/prefect/utilities/asyncutils.py", line 57, in run_sync_in_worker_thread
    return await anyio.to_thread.run_sync(call, cancellable=True)
  File "/usr/local/lib/python3.10/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/usr/local/lib/python3.10/site-packages/prefect/infrastructure/kubernetes.py", line 432, in _watch_job
    status_code=first_container_status.state.terminated.exit_code,
AttributeError: 'NoneType' object has no attribute 'exit_code'
k
I’m experiencing this a lot as well on K8s. My pods continue running and complete the tasks, but the agent somehow loses track of it at some point before completion and complains about:
An error occured while monitoring flow run {uuid}. The flow run will not be marked as failed, but an issue may have occurred.
This happens in about 1/20 runs. The flows then becomes stuck in the running state causing problems for other runs due to concurrency limits in our pipeline. Would be so grateful for any insight into this issue 🙂 Thanks
z
Is a traceback displayed for the error?
If not, I would recommend enabling debug level logs.
k
traceback
z
I see, so while we are watching the job the pod is gone
Do you know what’s happening to the pod? 🙂
k
The job-pod continues executing the flow and tasks to complection. (We stream logs to elastic search from pods). In the case above flow completion happened 15 seconds after the agent reported the error. I.e. the pod continued executing the flow with no problems. Also there were 40 seconds between the last log message the agent logged from the pod and the agent throwing the error. (Guess there is a timeout on agents losing contact with pods)
z
Would you mind opening an issue with these details and the logs?
It seems unlikely that I’ll be able to resolve it for you here in Slack and we’ll need to dig into it further.
👍 1
561 Views