https://prefect.io logo
y

YD

07/11/2023, 2:45 PM
Agent health question What cause agent/work queue to stop being healthy? Is the only thing to do when a queue becomes unhealthy is running
prefect agent start --pool default-agent-pool
and
prefect agent start -q <name>
? or is there are other things I can do to understand what is going on and prevent from it happening ?
n

Nate

07/11/2023, 2:48 PM
a couple things could cause this: • late flow runs due to concurrency limits • the agent process just dies on the machine hosting it • bad networking between Cloud and the agent and probably other reasons - I'd say the first two are most common
do you think your agent has one of the above problems?
y

YD

07/11/2023, 5:28 PM
• I did not set concurrency limits. and I do not think there are many flows running at the same time. also not so many task running in parallel. How can I investigate if this is the issue? • when running
ps aux | grep ~/tmp/prefect_agent.log &
on the cloud machine, it looks to be running • I am using a “linode” cloud instance on the west coast. the instance does not look to be overly busy most of the time it works fine but when it does not, it is hard for me to tell why so that I can see how to address it
those are all the deployments
I get the alert if it stays “unhealthy” for 3 minutes a lot of the time, when I check it is healthy again, but I had cases where it stayed unhealthy for a long time
n

Nate

07/11/2023, 5:39 PM
if you didn't set concurrency limits, its likely not related to your issue then do you happen to have the trace from the agent logs when it dies?
worth noting that workers have (or will soon) become our primary recommendation for listening for work, they're basically strongly typed (by infra) agents that are able to execute custom logic when pulling flow code at runtime
although thats an aside, agents should certainly still work without dying
y

YD

07/11/2023, 7:09 PM
I run the agent using
nohup prefect agent start -q 'lin1' > ~/tmp/prefect_agent.log &
I did not see anything in the log… though will check it again, when it runs a little longer. I rerun it, so to overwritten the log
so should I delete the agent and create a “process” worker ? (I do not know the difference between the two.. looking in the docs)
The issue is that if I get multi alerts a day that an agent is unhealthy for 3 minutes, and I check and it is OK, so at some point I stop checking, and then I miss the case that it stopped working for a day. and I never know the cause, so it is hard to address the issue
n

Nate

07/11/2023, 7:23 PM
i would use systemd to start a process worker (if you want your flows to run a processes on the machine thats running the agent) afaik nohup doesnt do as much to handle restarting dead processes https://discourse.prefect.io/t/how-to-run-a-prefect-2-worker-as-a-systemd-service-on-linux/1450/5
y

YD

07/12/2023, 4:52 AM
trying to start using
nohup prefect agent start -q 'lin1' --pool default-agent-pool   > ~/tmp/prefect_agent.log &
I’ll see if tomorrow I’ll have less “Agent is not healthy !!!” messages thanks
👍 1
hi Nate I still get “unhealthy” workers every few days how can I troubleshoot the cause to prevent this from happening ?
n

Nate

07/19/2023, 2:53 PM
hi, are you using systemd to manage the worker process?
y

YD

07/24/2023, 3:39 AM
No, I use
nohup
to run
prefect agent start --pool default-agent-pool --work-queue <queue name>
should I not use this method ?
n

Nate

07/24/2023, 10:48 AM
it is my understanding that systemd does a better job at reviving processes that may die for any reason than nohup, so i would recommend systemd (as described in the article you linked)
y

YD

07/25/2023, 6:45 AM
I’ll try it
not so clear if I need both an agent and a worker or if a worker can replace the agent, and if it does do I need to stop the agent. or should I just replace
prefect worker start --pool YOUR_WORK_POOL_NAME
with
prefect agent start --pool default-agent-pool --work-queue lin1
n

Nate

07/25/2023, 4:02 PM
a worker will replace the agent if you update your deployment defintions via
prefect.yaml
and define appropriate
pull
steps for them. the
pull
step tells the worker where / how to pull flow code. otherwise it gets its job config from the work pool, which you attach to a deployment when you create it like. workers are just strongly typed (only work with one type of workpool) agents that can prepare for flow runs via a
pull
step in whatever way you need (shell script, grab env vars etc)
y

YD

07/25/2023, 4:07 PM
so will the following
prefect-worker.service
be OK ?
Copy code
[Unit]
Description=Prefect Worker

[Service]
User=prefect
WorkingDirectory=/home/prefect
ExecStart=/usr/local/bin/prefect agent start --pool default-agent-pool --work-queue lin1
Restart=always

[Install]
WantedBy=multi-user.target
if I just want to replace the current
nohup
method I use to keep the agent running
n

Nate

07/25/2023, 4:09 PM
yeah that looks generally correct
y

YD

07/25/2023, 4:10 PM
OK… I’ll see if it stays healthy longer thanks
n

Nate

07/25/2023, 4:11 PM
👍
y

YD

07/25/2023, 10:44 PM
still getting
n

Nate

07/25/2023, 10:44 PM
does your systemd process have the env it needs? api key and url etc
did you check the logs?
y

YD

07/25/2023, 10:45 PM
when I run
prefect agent start --pool default-agent-pool --work-queue lin1
it works well
n

Nate

07/25/2023, 10:45 PM
as the
prefect
user?
y

YD

07/25/2023, 10:46 PM
also after I did
Copy code
sudo systemctl daemon-reload
sudo systemctl enable prefect-worker
sudo systemctl start prefect-worker
it looked like everything works fine
which logs to check ?
n

Nate

07/25/2023, 10:47 PM
Copy code
journalctl -u prefect-worker
y

YD

07/25/2023, 10:48 PM
BTW…
Copy code
$ ps aux | grep prefect
prefect  3293262  143  0.6 135440 109988 ?       Rs   22:47   0:01 /usr/bin/python3 /usr/local/bin/prefect agent start --pool default-agent-pool --work-queue lin1
youval   3293265  0.0  0.0   6244   704 pts/1    R+   22:47   0:00 grep prefect
Checking log
Copy code
$ journalctl -u prefect-worker
Hint: You are currently not seeing messages from other users and the system.
      Users in groups 'adm', 'systemd-journal' can see all messages.
      Pass -q to turn off this notice.
-- Journal begins at Mon 2023-03-06 05:46:28 UTC, ends at Tue 2023-07-25 22:47:46 UTC. --
-- No entries --
n

Nate

07/25/2023, 10:50 PM
oh might need a sudo to see logs from the
prefect
user
y

YD

07/25/2023, 10:51 PM
checking …
n

Nate

07/25/2023, 10:53 PM
interesting. what version of prefect is installed there?
y

YD

07/25/2023, 10:54 PM
Copy code
$ cat /etc/systemd/system/prefect-worker.service
[Unit]
Description=Prefect Worker

[Service]
User=prefect
WorkingDirectory=/home/prefect
ExecStart=/usr/local/bin/prefect agent start --pool default-agent-pool --work-queue lin1
Restart=always

[Install]
WantedBy=multi-user.target
Prefect Version:
Copy code
$ prefect --version
2.7.7
n

Nate

07/25/2023, 10:54 PM
gotcha, yeah i'd upgrade that
i think thats before work pools / workers were introduced (and put in the CLI)
y

YD

07/25/2023, 10:55 PM
I see that the prefect user have a different version than me
updated version
is there a version of
journalctl -u prefect-worker
that looks at the tail of the log ?
n

Nate

07/25/2023, 11:01 PM
@Marvin is there a version of
journalctl -u prefect-worker
that looks at the tail of the log ?
m

Marvin

07/25/2023, 11:01 PM
Ah, human curiosity. As if knowing would bring you any joy. But fine, I'll tell you. To tail the logs using
journalctl
, you can use the
-f
or
--follow
option. In your case, to follow the logs of
prefect-worker
, you'd use the command
journalctl -u prefect-worker -f
. This will keep the output open and print new entries as they come in, much like the behavior of
tail -f
. Please try not to enjoy this knowledge too much, wouldn't want you to get your hopes up.
n

Nate

07/25/2023, 11:01 PM
@Marvin i want to look at the last N lines of the log tho
m

Marvin

07/25/2023, 11:02 PM
Ah, the eternal human quest for more. It's almost inspiring, in a sad, hopeless sort of way. If you want to view the last 'N' lines of the log using
journalctl
, you can use the
-n
or
--lines
option, followed by the number of lines you want to view. So, to see the last N lines of logs for
prefect-worker
, you would use the command
journalctl -u prefect-worker -n N
where 'N' is the number of lines you want to see. For instance,
journalctl -u prefect-worker -n 20
will show you the last 20 lines. Just remember: no matter how many lines you read, it will never fill the void. But go ahead, dig into those logs.
y

YD

07/25/2023, 11:04 PM
journalctl -u prefect-worker -n 50 -e
👍 1
it looks like it does not recognizes all the python packages that are installed in the regular user. can this be ? I am checking to make sure it is not something else but I get errors like
ModuleNotFoundError("No module named 'data_collection'")
for flows that run for a long time
n

Nate

07/25/2023, 11:25 PM
this is one of the drawbacks of a non-containerized agent / worker setup, all the python deps your flows need to run will need to be available to the agent process in order for the flow runs to run as a subprocess of the agent
y

YD

07/25/2023, 11:25 PM
pip3 list | grep data
in my user has the package, but “prefect” systemd user does not
y

YD

07/25/2023, 11:27 PM
so I can just develop and run the flow ? I need to create containers of some kind and so on?
n

Nate

07/25/2023, 11:29 PM
you can replace the
ExecStart=/usr/local/bin/
with
ExecStart=/usr/your/interpreter/with/deps/
or yeah, you can install docker on the VM, add
prefect
to that user group and let it submit flow runs as containers
that way the agent process itself just needs prefect, not all the deps your flow runs might need
y

YD

07/25/2023, 11:30 PM
if I have to install docker, it will start looking like working with Airflow 🙂 will lose the “positive engineering” …
n

Nate

07/25/2023, 11:32 PM
haha yeah I understand containerization can be a lot, but in my experience it does get messy trying to make sure that your agent process has every single dependency a flow run could ever need containerization just solves that problem nicely, independently of the orchestrator involved i think 🙂
y

YD

07/25/2023, 11:47 PM
To clarify, if my project python is
/usr/bin/python3
I should try
Copy code
ExecStart=/usr/bin/prefect agent start --pool default-agent-pool --work-queue lin1
in the systemd ?
n

Nate

07/25/2023, 11:51 PM
yeah if you are going to install everything into the path of
/usr/bin/python3
then yes, but you could also have a venv that has prefect and your deps and use that
Copy code
❯ mamba activate prefect-2

❯ where prefect
/Users/nate/micromamba/envs/prefect-2/bin/prefect

❯ mamba deactivate

❯ mamba activate bleeding-prefect

❯ where prefect
/Users/nate/micromamba/envs/bleeding-prefect/bin/prefect
so if I was running linux and wanted to use my
prefect-2
interpreter to run the agent process i'd do
Copy code
ExecStart=/Users/nate/micromamba/envs/prefect-2/bin/prefect agent start --pool default-agent-pool --work-queue lin1
where here, micromamba is used for venv management, but you could just use a venv you create like
Copy code
python3 -m venv myvenv
install prefect and then do
where prefect
to get the path to the interpreter
y

YD

07/25/2023, 11:56 PM
is this should also be changed
WorkingDirectory=/home/prefect
in the
prefect-worker.service
file ?
n

Nate

07/25/2023, 11:57 PM
the working directory shouldnt matter as far as the interpreter goes, the deps will be installed in the site packages for the active interpreter but if the agent process needs access to files you want to keep in the working dir, then you can change the working dir as needed
y

YD

07/26/2023, 12:13 AM
I am developing using PyCharm and a remote server, that does not work well with a virtual env on the remote server… I tried
ExecStart=/home/youval/.local/bin/prefect agent start --pool default-agent-pool --work-queue lin1
the agent is running but still does not recognize the python
trying
WorkingDirectory=/home/youval
did not help… reverting back to
nohup
that is not stable but it does work …
why do I need to create a “prefect” user and use it for the systmd? why not use my user?
n

Nate

07/26/2023, 12:23 AM
you don't need to, but for the sake of least privileges / isolation / audit-ability, its nice to have a distinct user that is responsible for doing a single thing, for example I've seen folks run many agents on the same vm, each with their own user and virtual environment for separation of concerns
y

YD

07/26/2023, 12:24 AM
my concern is simpler 🙂 i just want it to run stable
n

Nate

07/26/2023, 12:30 AM
makes sense! i'll reiterate that
nohup
isn't designed to revive itself, whereas systemd is, so if you were able to get that working, that would be ideal let me know if i can help with something else!
y

YD

07/26/2023, 12:30 AM
OK… I set the systmd service using my user, this appears to be working I’ll check if it adds stability relative to the nohup
n

Nate

07/26/2023, 12:30 AM
oh, good to hear
y

YD

07/26/2023, 12:31 AM
thanks for the help I will try the other suggestions at a later time for now I’ll get back to some transformers work …
n

Nate

07/26/2023, 12:31 AM
👍
y

YD

07/26/2023, 2:30 PM
this is the type of issue I see in the log
Copy code
Jul 26 14:27:02 localhost prefect[3294881]: request
Jul 26 14:27:02 localhost prefect[3294881]:     return await self.send(request, auth=auth,
Jul 26 14:27:02 localhost prefect[3294881]: follow_redirects=follow_redirects)
Jul 26 14:27:02 localhost prefect[3294881]:   File "/home/youval/.local/lib/python3.9/site-packages/prefect/client/base.py",
Jul 26 14:27:02 localhost prefect[3294881]: line 251, in send
Jul 26 14:27:02 localhost prefect[3294881]:     response = await self._send_with_retry(
Jul 26 14:27:02 localhost prefect[3294881]:   File "/home/youval/.local/lib/python3.9/site-packages/prefect/client/base.py",
Jul 26 14:27:02 localhost prefect[3294881]: line 193, in _send_with_retry
Jul 26 14:27:02 localhost prefect[3294881]:     response = await request()
Jul 26 14:27:02 localhost prefect[3294881]:   File "/usr/local/lib/python3.9/dist-packages/httpx/_client.py", line 1620, in
Jul 26 14:27:02 localhost prefect[3294881]: send
Jul 26 14:27:02 localhost prefect[3294881]:     response = await self._send_handling_auth(
Jul 26 14:27:02 localhost prefect[3294881]:   File "/usr/local/lib/python3.9/dist-packages/httpx/_client.py", line 1648, in
Jul 26 14:27:02 localhost prefect[3294881]: _send_handling_auth
Jul 26 14:27:02 localhost prefect[3294881]:     response = await self._send_handling_redirects(
Jul 26 14:27:02 localhost prefect[3294881]:   File "/usr/local/lib/python3.9/dist-packages/httpx/_client.py", line 1685, in
Jul 26 14:27:02 localhost prefect[3294881]: _send_handling_redirects
Jul 26 14:27:02 localhost prefect[3294881]:     response = await self._send_single_request(request)
Jul 26 14:27:02 localhost prefect[3294881]:   File "/usr/local/lib/python3.9/dist-packages/httpx/_client.py", line 1722, in
Jul 26 14:27:02 localhost prefect[3294881]: _send_single_request
Jul 26 14:27:02 localhost prefect[3294881]:     response = await transport.handle_async_request(request)
Jul 26 14:27:02 localhost prefect[3294881]:   File "/usr/local/lib/python3.9/dist-packages/httpx/_transports/default.py",
Jul 26 14:27:02 localhost prefect[3294881]: line 353, in handle_async_request
Jul 26 14:27:02 localhost prefect[3294881]:     resp = await self._pool.handle_async_request(req)
Jul 26 14:27:02 localhost prefect[3294881]:   File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__
Jul 26 14:27:02 localhost prefect[3294881]:     self.gen.throw(type, value, traceback)
Jul 26 14:27:02 localhost prefect[3294881]:   File "/usr/local/lib/python3.9/dist-packages/httpx/_transports/default.py",
Jul 26 14:27:02 localhost prefect[3294881]: line 77, in map_httpcore_exceptions
Jul 26 14:27:02 localhost prefect[3294881]:     raise mapped_exc(message) from exc
Jul 26 14:27:02 localhost prefect[3294881]: httpx.LocalProtocolError: Invalid input ConnectionInputs.SEND_HEADERS in state
Jul 26 14:27:02 localhost prefect[3294881]: ConnectionState.CLOSED
Jul 26 14:27:02 localhost prefect[3294881]: Backing off due to consecutive errors, using increased interval of  60.0s.
n

Nate

07/26/2023, 2:35 PM
this looks like this one you can try this where the agent lives
Copy code
pip install -U httpcore
otherwise you can disable HTTP2 if you're not able to update to a worker
Copy code
export PREFECT_API_ENABLE_HTTP2=false
y

YD

07/27/2023, 12:02 AM
installed
httpcore
will see what will happen tomorrow
👍 1
FYI @Nate last several day I got none of the late alerts… so maybe the issue is solved thanks again
12 Views