Hey Folks, how can I check the health of a Perfect Agent in 2.0? It is running inside a docker conta...
p
Hey Folks, how can I check the health of a Perfect Agent in 2.0? It is running inside a docker container and I’m trying to figure out a reliable health check locally instead of pinging the Orion server.
k
Hi @Prem Viswanathan, you should be able to see the agent health check on the work queue in the UI.
🙌 1
p
Is there a programmatic way to do it? Ideally from within the container where the agent is running.
So I’m assuming this isn’t really an option? How does the orion server actually check if agent is healthy?
z
The Orion server gets pings from the agent
In v1, we have the agent host a little server that has a
/health
endpoint — we could do the same here. If the agent process is running it should be healthy though — if it’s running and unhealthy that’s a bug.
🙏 1
p
okay, so with v2, does the
/health
endpoint option still exist? Can we enable it?
z
No we haven’t added one. I don’t see the point of a health endpoint if the process exits when unhealthy. What’s your use-case?
p
We run a certain number of Agents as a service on ECS - which picks flow from a queue and executes them. I’m trying to figure out if I need an explicit local health check within the Agent Task to trigger the removal of that task container and trigger the scale-up of a replacement agent container.
z
Are the agents running flows locally or on external infrastructure?
1
p
locally.
z
Ah, it does seem possible for the agent to be “unhealthy” then in that if it is running a bunch of work it may be resource starved and unable to query for more runs.
1
We’ve got some changes coming to this interface soon, I don’t think we can promise this quickly but it’s in the works. cc @Jeremiah
j
Yes, we’ll be supercharging agents in the very near term. I would think this enhancement will be straightforward on top of those changes
🙌 2
p
Got it. Thanks for the input, folks. So sounds like another typical workflow is the agent acting like “an orchestrator” - taking work off the queue and running the flow on a different compute unit?
z
Yeah that’s far more common for production usage so you can allocate resources per flow run.
p
Yeah, with ECS tasks, that part takes too long to trigger the run. so hence we’re trying the workers on standby approach
z
Are your runs adhoc or scheduled?
p
adhoc
z
Ah that’s trickier. For scheduled runs, we can submit them early.
For adhoc runs, if latency is important, it sounds like you’ve got the correct solution.
If you open an enhancement request on GitHub, we can track a change to support this.
🙏 1
p
Got it; appreciate the input. To confirm, my enhancement request would be a feature request to enable tracking agent health locally, right?
z
👍
I could see us returning information on consumed concurrency slots too, allowing you to know when to scale up.
🙏 1