Agent health question What cause agent work queue to stop b Prefect Community #ask-community

Agent health question What cause agent/work queue ...

07/11/2023, 2:45 PM

Agent health question What cause agent/work queue to stop being healthy? Is the only thing to do when a queue becomes unhealthy is running

prefect agent start --pool default-agent-pool

and

prefect agent start -q <name>

? or is there are other things I can do to understand what is going on and prevent from it happening ?

Nate

07/11/2023, 2:48 PM

a couple things could cause this: • late flow runs due to concurrency limits • the agent process just dies on the machine hosting it • bad networking between Cloud and the agent and probably other reasons - I'd say the first two are most common

Nate

07/11/2023, 2:49 PM

do you think your agent has one of the above problems?

07/11/2023, 5:28 PM

• I did not set concurrency limits. and I do not think there are many flows running at the same time. also not so many task running in parallel. How can I investigate if this is the issue? • when running

ps aux | grep ~/tmp/prefect_agent.log &

on the cloud machine, it looks to be running • I am using a “linode” cloud instance on the west coast. the instance does not look to be overly busy most of the time it works fine but when it does not, it is hard for me to tell why so that I can see how to address it

07/11/2023, 5:30 PM

those are all the deployments

07/11/2023, 5:36 PM

I get the alert if it stays “unhealthy” for 3 minutes a lot of the time, when I check it is healthy again, but I had cases where it stayed unhealthy for a long time

Nate

07/11/2023, 5:39 PM

if you didn't set concurrency limits, its likely not related to your issue then do you happen to have the trace from the agent logs when it dies?

Nate

07/11/2023, 5:40 PM

worth noting that workers have (or will soon) become our primary recommendation for listening for work, they're basically strongly typed (by infra) agents that are able to execute custom logic when pulling flow code at runtime

Nate

07/11/2023, 5:41 PM

although thats an aside, agents should certainly still work without dying

07/11/2023, 7:09 PM

I run the agent using

nohup prefect agent start -q 'lin1' > ~/tmp/prefect_agent.log &

I did not see anything in the log… though will check it again, when it runs a little longer. I rerun it, so to overwritten the log

07/11/2023, 7:15 PM

so should I delete the agent and create a “process” worker ? (I do not know the difference between the two.. looking in the docs)

07/11/2023, 7:21 PM

The issue is that if I get multi alerts a day that an agent is unhealthy for 3 minutes, and I check and it is OK, so at some point I stop checking, and then I miss the case that it stopped working for a day. and I never know the cause, so it is hard to address the issue

Nate

07/11/2023, 7:23 PM

i would use systemd to start a process worker (if you want your flows to run a processes on the machine thats running the agent) afaik nohup doesnt do as much to handle restarting dead processes https://discourse.prefect.io/t/how-to-run-a-prefect-2-worker-as-a-systemd-service-on-linux/1450/5

07/12/2023, 4:52 AM

trying to start using

nohup prefect agent start -q 'lin1' --pool default-agent-pool   > ~/tmp/prefect_agent.log &

I’ll see if tomorrow I’ll have less “Agent is not healthy !!!” messages thanks

👍 1

07/19/2023, 2:50 PM

hi Nate I still get “unhealthy” workers every few days how can I troubleshoot the cause to prevent this from happening ?

Nate

07/19/2023, 2:53 PM

hi, are you using systemd to manage the worker process?

07/24/2023, 3:39 AM

No, I use

nohup

to run

prefect agent start --pool default-agent-pool --work-queue <queue name>

07/24/2023, 3:52 AM

should I not use this method ?

07/24/2023, 3:54 AM

https://discourse.prefect.io/t/how-to-run-a-prefect-2-worker-as-a-systemd-service-on-linux/1450

Nate

07/24/2023, 10:48 AM

it is my understanding that systemd does a better job at reviving processes that may die for any reason than nohup, so i would recommend systemd (as described in the article you linked)

07/25/2023, 6:45 AM

I’ll try it

07/25/2023, 3:57 PM

not so clear if I need both an agent and a worker or if a worker can replace the agent, and if it does do I need to stop the agent. or should I just replace

prefect worker start --pool YOUR_WORK_POOL_NAME

with

prefect agent start --pool default-agent-pool --work-queue lin1

Nate

07/25/2023, 4:02 PM

a worker will replace the agent if you update your deployment defintions via

prefect.yaml

and define appropriate

pull

steps for them. the

pull

step tells the worker where / how to pull flow code. otherwise it gets its job config from the work pool, which you attach to a deployment when you create it like. workers are just strongly typed (only work with one type of workpool) agents that can prepare for flow runs via a

pull

step in whatever way you need (shell script, grab env vars etc)

07/25/2023, 4:07 PM

so will the following

prefect-worker.service

be OK ?

Copy code

[Unit]
Description=Prefect Worker

[Service]
User=prefect
WorkingDirectory=/home/prefect
ExecStart=/usr/local/bin/prefect agent start --pool default-agent-pool --work-queue lin1
Restart=always

[Install]
WantedBy=multi-user.target

07/25/2023, 4:08 PM

if I just want to replace the current

nohup

method I use to keep the agent running

Nate

07/25/2023, 4:09 PM

yeah that looks generally correct

07/25/2023, 4:10 PM

OK… I’ll see if it stays healthy longer thanks

Nate

07/25/2023, 4:11 PM

👍

07/25/2023, 10:44 PM

still getting

Nate

07/25/2023, 10:44 PM

does your systemd process have the env it needs? api key and url etc

Nate

07/25/2023, 10:44 PM

did you check the logs?

07/25/2023, 10:45 PM

when I run

prefect agent start --pool default-agent-pool --work-queue lin1

it works well

Nate

07/25/2023, 10:45 PM

as the

prefect

user?

07/25/2023, 10:46 PM

also after I did

Copy code

sudo systemctl daemon-reload
sudo systemctl enable prefect-worker
sudo systemctl start prefect-worker

it looked like everything works fine

07/25/2023, 10:46 PM

which logs to check ?

Nate

07/25/2023, 10:47 PM

Copy code

journalctl -u prefect-worker

07/25/2023, 10:48 PM

BTW…

Copy code

$ ps aux | grep prefect
prefect  3293262  143  0.6 135440 109988 ?       Rs   22:47   0:01 /usr/bin/python3 /usr/local/bin/prefect agent start --pool default-agent-pool --work-queue lin1
youval   3293265  0.0  0.0   6244   704 pts/1    R+   22:47   0:00 grep prefect

Checking log

07/25/2023, 10:49 PM

Copy code

$ journalctl -u prefect-worker
Hint: You are currently not seeing messages from other users and the system.
      Users in groups 'adm', 'systemd-journal' can see all messages.
      Pass -q to turn off this notice.
-- Journal begins at Mon 2023-03-06 05:46:28 UTC, ends at Tue 2023-07-25 22:47:46 UTC. --
-- No entries --

Nate

07/25/2023, 10:50 PM

oh might need a sudo to see logs from the

prefect

user

07/25/2023, 10:51 PM

checking …

Nate

07/25/2023, 10:53 PM

interesting. what version of prefect is installed there?

07/25/2023, 10:54 PM

Copy code

$ cat /etc/systemd/system/prefect-worker.service
[Unit]
Description=Prefect Worker

[Service]
User=prefect
WorkingDirectory=/home/prefect
ExecStart=/usr/local/bin/prefect agent start --pool default-agent-pool --work-queue lin1
Restart=always

[Install]
WantedBy=multi-user.target

Prefect Version:

Copy code

$ prefect --version
2.7.7

Nate

07/25/2023, 10:54 PM

gotcha, yeah i'd upgrade that

Nate

07/25/2023, 10:54 PM

i think thats before work pools / workers were introduced (and put in the CLI)

07/25/2023, 10:55 PM

I see that the prefect user have a different version than me

07/25/2023, 10:59 PM

updated version

07/25/2023, 11:00 PM

is there a version of

journalctl -u prefect-worker

that looks at the tail of the log ?

Nate

07/25/2023, 11:01 PM

@Marvin is there a version of

journalctl -u prefect-worker

that looks at the tail of the log ?

Marvin

07/25/2023, 11:01 PM

Ah, human curiosity. As if knowing would bring you any joy. But fine, I'll tell you. To tail the logs using

journalctl

, you can use the

-f

--follow

option. In your case, to follow the logs of

prefect-worker

, you'd use the command

journalctl -u prefect-worker -f

. This will keep the output open and print new entries as they come in, much like the behavior of

tail -f

. Please try not to enjoy this knowledge too much, wouldn't want you to get your hopes up.

Nate

07/25/2023, 11:01 PM

@Marvin i want to look at the last N lines of the log tho

Marvin

07/25/2023, 11:02 PM

Ah, the eternal human quest for more. It's almost inspiring, in a sad, hopeless sort of way. If you want to view the last 'N' lines of the log using

journalctl

, you can use the

-n

--lines

option, followed by the number of lines you want to view. So, to see the last N lines of logs for

prefect-worker

, you would use the command

journalctl -u prefect-worker -n N

where 'N' is the number of lines you want to see. For instance,

journalctl -u prefect-worker -n 20

will show you the last 20 lines. Just remember: no matter how many lines you read, it will never fill the void. But go ahead, dig into those logs.

07/25/2023, 11:04 PM

journalctl -u prefect-worker -n 50 -e

👍 1

07/25/2023, 11:19 PM

it looks like it does not recognizes all the python packages that are installed in the regular user. can this be ? I am checking to make sure it is not something else but I get errors like

ModuleNotFoundError("No module named 'data_collection'")

for flows that run for a long time

Nate

07/25/2023, 11:25 PM

this is one of the drawbacks of a non-containerized agent / worker setup, all the python deps your flows need to run will need to be available to the agent process in order for the flow runs to run as a subprocess of the agent

07/25/2023, 11:25 PM

pip3 list | grep data

in my user has the package, but “prefect” systemd user does not

Nate

07/25/2023, 11:27 PM

https://discourse.prefect.io/t/how-to-run-a-prefect-2-worker-as-a-systemd-service-on-linux/1450/3?u=nate

07/25/2023, 11:27 PM

so I can just develop and run the flow ? I need to create containers of some kind and so on?

Nate

07/25/2023, 11:29 PM

you can replace the

ExecStart=/usr/local/bin/

with

ExecStart=/usr/your/interpreter/with/deps/

or yeah, you can install docker on the VM, add

prefect

to that user group and let it submit flow runs as containers

Nate

07/25/2023, 11:29 PM

that way the agent process itself just needs prefect, not all the deps your flow runs might need

07/25/2023, 11:30 PM

if I have to install docker, it will start looking like working with Airflow 🙂 will lose the “positive engineering” …

Nate

07/25/2023, 11:32 PM

haha yeah I understand containerization can be a lot, but in my experience it does get messy trying to make sure that your agent process has every single dependency a flow run could ever need containerization just solves that problem nicely, independently of the orchestrator involved i think 🙂

07/25/2023, 11:47 PM

To clarify, if my project python is

/usr/bin/python3

I should try

Copy code

ExecStart=/usr/bin/prefect agent start --pool default-agent-pool --work-queue lin1

in the systemd ?

Nate

07/25/2023, 11:51 PM

yeah if you are going to install everything into the path of

/usr/bin/python3

then yes, but you could also have a venv that has prefect and your deps and use that

Copy code

❯ mamba activate prefect-2

❯ where prefect
/Users/nate/micromamba/envs/prefect-2/bin/prefect

❯ mamba deactivate

❯ mamba activate bleeding-prefect

❯ where prefect
/Users/nate/micromamba/envs/bleeding-prefect/bin/prefect

so if I was running linux and wanted to use my

prefect-2

interpreter to run the agent process i'd do

Copy code

ExecStart=/Users/nate/micromamba/envs/prefect-2/bin/prefect agent start --pool default-agent-pool --work-queue lin1

where here, micromamba is used for venv management, but you could just use a venv you create like

Copy code

python3 -m venv myvenv

install prefect and then do

where prefect

to get the path to the interpreter

07/25/2023, 11:56 PM

is this should also be changed

WorkingDirectory=/home/prefect

in the

prefect-worker.service

file ?

Nate

07/25/2023, 11:57 PM

the working directory shouldnt matter as far as the interpreter goes, the deps will be installed in the site packages for the active interpreter but if the agent process needs access to files you want to keep in the working dir, then you can change the working dir as needed

07/26/2023, 12:13 AM

I am developing using PyCharm and a remote server, that does not work well with a virtual env on the remote server… I tried

ExecStart=/home/youval/.local/bin/prefect agent start --pool default-agent-pool --work-queue lin1

the agent is running but still does not recognize the python

07/26/2023, 12:13 AM

trying

WorkingDirectory=/home/youval

07/26/2023, 12:16 AM

did not help… reverting back to

nohup

that is not stable but it does work …

07/26/2023, 12:20 AM

why do I need to create a “prefect” user and use it for the systmd? why not use my user?

Nate

07/26/2023, 12:23 AM

you don't need to, but for the sake of least privileges / isolation / audit-ability, its nice to have a distinct user that is responsible for doing a single thing, for example I've seen folks run many agents on the same vm, each with their own user and virtual environment for separation of concerns

07/26/2023, 12:24 AM

my concern is simpler 🙂 i just want it to run stable

Nate

07/26/2023, 12:30 AM

makes sense! i'll reiterate that

nohup

isn't designed to revive itself, whereas systemd is, so if you were able to get that working, that would be ideal let me know if i can help with something else!

07/26/2023, 12:30 AM

OK… I set the systmd service using my user, this appears to be working I’ll check if it adds stability relative to the nohup

Nate

07/26/2023, 12:30 AM

oh, good to hear

07/26/2023, 12:31 AM

thanks for the help I will try the other suggestions at a later time for now I’ll get back to some transformers work …

Nate

07/26/2023, 12:31 AM

👍

07/26/2023, 2:30 PM

this is the type of issue I see in the log

Copy code

Jul 26 14:27:02 localhost prefect[3294881]: request
Jul 26 14:27:02 localhost prefect[3294881]:     return await self.send(request, auth=auth,
Jul 26 14:27:02 localhost prefect[3294881]: follow_redirects=follow_redirects)
Jul 26 14:27:02 localhost prefect[3294881]:   File "/home/youval/.local/lib/python3.9/site-packages/prefect/client/base.py",
Jul 26 14:27:02 localhost prefect[3294881]: line 251, in send
Jul 26 14:27:02 localhost prefect[3294881]:     response = await self._send_with_retry(
Jul 26 14:27:02 localhost prefect[3294881]:   File "/home/youval/.local/lib/python3.9/site-packages/prefect/client/base.py",
Jul 26 14:27:02 localhost prefect[3294881]: line 193, in _send_with_retry
Jul 26 14:27:02 localhost prefect[3294881]:     response = await request()
Jul 26 14:27:02 localhost prefect[3294881]:   File "/usr/local/lib/python3.9/dist-packages/httpx/_client.py", line 1620, in
Jul 26 14:27:02 localhost prefect[3294881]: send
Jul 26 14:27:02 localhost prefect[3294881]:     response = await self._send_handling_auth(
Jul 26 14:27:02 localhost prefect[3294881]:   File "/usr/local/lib/python3.9/dist-packages/httpx/_client.py", line 1648, in
Jul 26 14:27:02 localhost prefect[3294881]: _send_handling_auth
Jul 26 14:27:02 localhost prefect[3294881]:     response = await self._send_handling_redirects(
Jul 26 14:27:02 localhost prefect[3294881]:   File "/usr/local/lib/python3.9/dist-packages/httpx/_client.py", line 1685, in
Jul 26 14:27:02 localhost prefect[3294881]: _send_handling_redirects
Jul 26 14:27:02 localhost prefect[3294881]:     response = await self._send_single_request(request)
Jul 26 14:27:02 localhost prefect[3294881]:   File "/usr/local/lib/python3.9/dist-packages/httpx/_client.py", line 1722, in
Jul 26 14:27:02 localhost prefect[3294881]: _send_single_request
Jul 26 14:27:02 localhost prefect[3294881]:     response = await transport.handle_async_request(request)
Jul 26 14:27:02 localhost prefect[3294881]:   File "/usr/local/lib/python3.9/dist-packages/httpx/_transports/default.py",
Jul 26 14:27:02 localhost prefect[3294881]: line 353, in handle_async_request
Jul 26 14:27:02 localhost prefect[3294881]:     resp = await self._pool.handle_async_request(req)
Jul 26 14:27:02 localhost prefect[3294881]:   File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__
Jul 26 14:27:02 localhost prefect[3294881]:     self.gen.throw(type, value, traceback)
Jul 26 14:27:02 localhost prefect[3294881]:   File "/usr/local/lib/python3.9/dist-packages/httpx/_transports/default.py",
Jul 26 14:27:02 localhost prefect[3294881]: line 77, in map_httpcore_exceptions
Jul 26 14:27:02 localhost prefect[3294881]:     raise mapped_exc(message) from exc
Jul 26 14:27:02 localhost prefect[3294881]: httpx.LocalProtocolError: Invalid input ConnectionInputs.SEND_HEADERS in state
Jul 26 14:27:02 localhost prefect[3294881]: ConnectionState.CLOSED
Jul 26 14:27:02 localhost prefect[3294881]: Backing off due to consecutive errors, using increased interval of  60.0s.

Nate

07/26/2023, 2:35 PM

this looks like this one you can try this where the agent lives

Copy code

pip install -U httpcore

otherwise you can disable HTTP2 if you're not able to update to a worker

Copy code

export PREFECT_API_ENABLE_HTTP2=false

07/27/2023, 12:02 AM

installed

httpcore

will see what will happen tomorrow

👍 1

08/02/2023, 5:41 AM

FYI @Nate last several day I got none of the late alerts… so maybe the issue is solved thanks again

14 Views

Open in Slack

Previous Next