Hello, with several long running flows I get an er...
# prefect-community
r
Hello, with several long running flows I get an error like:
Copy code
prefect.infrastructure.process - Process 'meticulous-manatee' exited with status code: -9
Any guidance?
1
z
Is this on Linux? -9 is a weird status code. Do you have output from the process?
r
It is running in a docker container on ECS which uses the prefect 2.0 as the base image. FWIW the flow doesn’t show up as crashed/failed on the docker UI
@Zanie Any thoughts on this?
z
I don’t know. I’d basically just be trying to google information about your situation and a status code of -9. That’s not a code we’ll ever set.
Can you get any logs or output from the process ?
r
RE logs - that will just show up with no python related stack trace; it will be logging normally and then just randomly die with that above message. This does occur after a flow runs for a long time (>1 hour with lots of logging; probably a log message every 20s on average). Note that the prefect UI does not reflect the exit status with a “Crashed” label and the flow keeps running while block any scheduled flows. This error might be related (?) https://prefect-community.slack.com/archives/C03D12VV4NN/p1659559088670819 which is why I am mentioning logging. Additionally I am also seeing long running flows (similar length/amount of logging) that will also die in the container with this error but are not reflected on the prefect 2 UI (they continue to show as running):
Copy code
01:44:36.945 | ERROR | prefect.agent - Server error '500 Internal Server Error' for url '<https://api.prefect.cloud/api/accounts/96d5401d-c460-465a-873c-db373c1e0ca9/workspaces/3c95e6c2-fda6-4a21-8bf8-a35a3a5e2ba9/work_queues/f664d525-bc08-4740-a437-7c3a5d375bf8/get_runs>'
{'exception_message': 'Internal Server Error'}	
Response: {'exception_message': 'Internal Server Error'}	
For more information check: <https://httpstatuses.com/500>	
Traceback (most recent call last):	
File "/usr/local/lib/python3.10/site-packages/prefect/agent.py", line 128, in get_and_submit_flow_runs	
queue_runs = await self.client.get_runs_in_work_queue(	
File "/usr/local/lib/python3.10/site-packages/prefect/client.py", line 918, in get_runs_in_work_queue	
response = await <http://self._client.post|self._client.post>(	
File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1842, in post	
return await self.request(	
File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1527, in request	
return await self.send(request, auth=auth, follow_redirects=follow_redirects)9/29/2022, 7:44:36 PM	ced74f2966d243fe8482e8248523e6f3 	For more information check: <https://httpstatuses.com/500>	
Traceback (most recent call last):	
File "/usr/local/lib/python3.10/site-packages/prefect/agent.py", line 128, in get_and_submit_flow_runs	
queue_runs = await self.client.get_runs_in_work_queue(	
File "/usr/local/lib/python3.10/site-packages/prefect/client.py", line 918, in get_runs_in_work_queue	
response = await <http://self._client.post|self._client.post>(	
File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1842, in post	
return await self.request(	
File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1527, in request	
return await self.send(request, auth=auth, follow_redirects=follow_redirects)
(if that stacktrace is missing information let me know; I am copying from ECS logs which format it weird)
z
Copy code
01:44:36.945 | ERROR | prefect.agent - Server error '500 Internal Server Error' for url '<https://api.prefect.cloud/api/accounts/96d5401d-c460-465a-873c-db373c1e0ca9/workspaces/3c95e6c2-fda6-4a21-8bf8-a35a3a5e2ba9/work_queues/f664d525-bc08-4740-a437-7c3a5d375bf8/get_runs>'
This log is coming from the agent failing to retrieve runs to submit. This doesn’t indicate failure of any of the runs it’s launched.
I found…
A “return code” of -9 indicates that the process was killed with SIGKILL. If you aren’t doing that yourself, the OOM killer is a likely culprit.
r
Thanks for info on the OOM stack overflow post; I will investigate. On the prefect error: will that 1. kill the prefect agent and, 2. if so do you have any thoughts on ways to restart an agent within an ECS container that dies like that? Normally I’d run stuff like this as a systemd service or similar but docker and systemd have not played nice in the past when I’ve tried stuff like that. I’m curious if there is a way to configure a healthchecker so that if the prefect agent dies the container will be restarted.
z
Yeah that will kill the agent. We can update the agent to be robust to failures there.
In v1 we attached a healthcheck API to the agent for that very purpose. We have not done so yet in v2, but we could.
Actually, nevermind re 1. The agent will not die on failures in the
get_runs_in_work_queue
call
It’ll log the error (as you see) but try again after its normal interval
e.g.
Copy code
try:
                    queue_runs = await self.client.get_runs_in_work_queue(
                        id=work_queue.id, limit=10, scheduled_before=before
                    )
                    submittable_runs.extend(queue_runs)
                except ObjectNotFound:
                    self.logger.error(
                        f"Work queue {work_queue.name!r} ({work_queue.id}) not found."
                    )
                except Exception as exc:
                    self.logger.exception(exc)