Hello with several long running flows I get an error like `` Prefect Community #ask-community

Hello, with several long running flows I get an er...

Rio McMahon

09/28/2022, 8:34 PM

Hello, with several long running flows I get an error like:

Copy code

prefect.infrastructure.process - Process 'meticulous-manatee' exited with status code: -9

Any guidance?

✅ 1

Zanie

09/28/2022, 10:26 PM

Is this on Linux? -9 is a weird status code. Do you have output from the process?

Rio McMahon

09/28/2022, 11:20 PM

It is running in a docker container on ECS which uses the prefect 2.0 as the base image. FWIW the flow doesn’t show up as crashed/failed on the docker UI

Rio McMahon

09/30/2022, 5:04 PM

@Zanie Any thoughts on this?

Zanie

09/30/2022, 5:06 PM

I don’t know. I’d basically just be trying to google information about your situation and a status code of -9. That’s not a code we’ll ever set.

Zanie

09/30/2022, 5:06 PM

Can you get any logs or output from the process ?

Rio McMahon

09/30/2022, 5:33 PM

RE logs - that will just show up with no python related stack trace; it will be logging normally and then just randomly die with that above message. This does occur after a flow runs for a long time (>1 hour with lots of logging; probably a log message every 20s on average). Note that the prefect UI does not reflect the exit status with a “Crashed” label and the flow keeps running while block any scheduled flows. This error might be related (?) https://prefect-community.slack.com/archives/C03D12VV4NN/p1659559088670819 which is why I am mentioning logging. Additionally I am also seeing long running flows (similar length/amount of logging) that will also die in the container with this error but are not reflected on the prefect 2 UI (they continue to show as running):

Copy code

01:44:36.945 | ERROR | prefect.agent - Server error '500 Internal Server Error' for url '<https://api.prefect.cloud/api/accounts/96d5401d-c460-465a-873c-db373c1e0ca9/workspaces/3c95e6c2-fda6-4a21-8bf8-a35a3a5e2ba9/work_queues/f664d525-bc08-4740-a437-7c3a5d375bf8/get_runs>'
{'exception_message': 'Internal Server Error'}	
Response: {'exception_message': 'Internal Server Error'}	
For more information check: <https://httpstatuses.com/500>	
Traceback (most recent call last):	
File "/usr/local/lib/python3.10/site-packages/prefect/agent.py", line 128, in get_and_submit_flow_runs	
queue_runs = await self.client.get_runs_in_work_queue(	
File "/usr/local/lib/python3.10/site-packages/prefect/client.py", line 918, in get_runs_in_work_queue	
response = await <http://self._client.post|self._client.post>(	
File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1842, in post	
return await self.request(	
File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1527, in request	
return await self.send(request, auth=auth, follow_redirects=follow_redirects)9/29/2022, 7:44:36 PM	ced74f2966d243fe8482e8248523e6f3 	For more information check: <https://httpstatuses.com/500>	
Traceback (most recent call last):	
File "/usr/local/lib/python3.10/site-packages/prefect/agent.py", line 128, in get_and_submit_flow_runs	
queue_runs = await self.client.get_runs_in_work_queue(	
File "/usr/local/lib/python3.10/site-packages/prefect/client.py", line 918, in get_runs_in_work_queue	
response = await <http://self._client.post|self._client.post>(	
File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1842, in post	
return await self.request(	
File "/usr/local/lib/python3.10/site-packages/httpx/_client.py", line 1527, in request	
return await self.send(request, auth=auth, follow_redirects=follow_redirects)

(if that stacktrace is missing information let me know; I am copying from ECS logs which format it weird)

Zanie

09/30/2022, 5:51 PM

Copy code

01:44:36.945 | ERROR | prefect.agent - Server error '500 Internal Server Error' for url '<https://api.prefect.cloud/api/accounts/96d5401d-c460-465a-873c-db373c1e0ca9/workspaces/3c95e6c2-fda6-4a21-8bf8-a35a3a5e2ba9/work_queues/f664d525-bc08-4740-a437-7c3a5d375bf8/get_runs>'

This log is coming from the agent failing to retrieve runs to submit. This doesn’t indicate failure of any of the runs it’s launched.

Zanie

09/30/2022, 5:53 PM

I found…

A “return code” of -9 indicates that the process was killed with SIGKILL. If you aren’t doing that yourself, the OOM killer is a likely culprit.

Rio McMahon

09/30/2022, 6:07 PM

Thanks for info on the OOM stack overflow post; I will investigate. On the prefect error: will that 1. kill the prefect agent and, 2. if so do you have any thoughts on ways to restart an agent within an ECS container that dies like that? Normally I’d run stuff like this as a systemd service or similar but docker and systemd have not played nice in the past when I’ve tried stuff like that. I’m curious if there is a way to configure a healthchecker so that if the prefect agent dies the container will be restarted.

Zanie

09/30/2022, 6:36 PM

Yeah that will kill the agent. We can update the agent to be robust to failures there.

Zanie

09/30/2022, 6:36 PM

In v1 we attached a healthcheck API to the agent for that very purpose. We have not done so yet in v2, but we could.

Zanie

09/30/2022, 6:37 PM

Actually, nevermind re 1. The agent will not die on failures in the

get_runs_in_work_queue

call

Zanie

09/30/2022, 6:37 PM

It’ll log the error (as you see) but try again after its normal interval

Zanie

09/30/2022, 6:38 PM

e.g.

Copy code

try:
                    queue_runs = await self.client.get_runs_in_work_queue(
                        id=work_queue.id, limit=10, scheduled_before=before
                    )
                    submittable_runs.extend(queue_runs)
                except ObjectNotFound:
                    self.logger.error(
                        f"Work queue {work_queue.name!r} ({work_queue.id}) not found."
                    )
                except Exception as exc:
                    self.logger.exception(exc)

5 Views

Open in Slack

Previous Next