Occasionally I'll have an agent that stops respond...
# ask-community
h
Occasionally I'll have an agent that stops responding for some reason. After I bounce it, it seems to take a while (like 20 minutes) before it can start picking up flows? Has anyone seen this?
k
Hi @Hugo Shi! Are you referring to the Lazarus process? Lazarus is a multi-tenant service that checks for Flow Runs with a heartbeat that is 10 or more minutes stale. The Lazarus process runs every 10 minutes, meaning it may take up to 20 minutes for a Flow Run to be recognized.
h
No, not the Lazarus - the agent itself. Whenever this happens, I need to bounce the agent, and after that it's 20-30 minutes before the agent can pick up any flows. It feels like my cloud scheduler is "busy"?
n
Hi @Hugo Shi - can you provide some more information about your agent and where you're running it? In particular I'm curious what agent type and what sorts of resources the machine has that you're running your agent on
h
Yes! I'm running 2 agents, one is a kubernetes-agent (but that has some Saturn specific machinery, so it's easier if we talk about the other one) I'm also running a Local Agent, both appear to have this issue.
(or had this issue)
what I usually observe is that both of my agents stop responding at the same time (this was around 5am Pacific time yesterday), and after that, restarting the agent will resolve the problem in about 30 minutes, but during those 30 minutes, the agents won't pick up anyflows
and at the 30-ish minute mark, there is a flurry of activity in the logs of picking up the flows that have been backed up
I'm not sure if this is related - but since we're develop a Prefect cloud integration in our product, we have automatically running integration tests that create flows and agents in our prefect cloud account, so I'm not sure if that creates additional load on our cloud scheduler?
n
Hm got it - and how are you restarting your agent? With a
sigterm
and
prefect agent start
?
h
It's actually running in k8s even though it's not a k8s agent - I'm stopping the pod, and then restarting it
my cmd line is like this:
Copy code
prefect agent local start --label local-ia -p /home/jovyan/git-repos/saturn-internal-analytics
n
ah understood - is it possible it takes some time for your pod to stop and start again, independent of the agent start command?
h
no i don't beleive so, because the agent is logging, it's just that it doens't pick up flows until 30 minutes later (and they are stuck in "scheduled" in the prefect cloud ui) in addition, the prefect cloud UI recognizes my agents as active
n
Hm interesting; this isn't something I've encountered before but I do have some suspicious. Can you confirm the system time on your agent pod?
h
lemme check
Tue Apr 13 172627 UTC 2021
you're thinking clock drift?
n
Hm that was a thought but now I'm less sure. The thing I can't explain is why the work would be released all at once; if there was clock drift I'd expect that the work was still released at the correct cadence, just offset by the drift