Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

Occasionally I'll have an agent that stops responding for some reason.  After I bounce it, it seems to take a while (like 20 minutes) before it can start picking up flows?  Has anyone seen this?

Hi <@UTQJ5786P>! Are you referring to the Lazarus process?

Lazarus is a multi-tenant service that checks for Flow Runs with a heartbeat that is 10 or more minutes stale. The Lazarus process runs every 10 minutes, meaning it may take up to 20 minutes for a Flow Run to be recognized.

No, not the Lazarus - the agent itself.  Whenever this happens,  I need to bounce the agent, and after that it's 20-30 minutes before the agent can pick up any flows.  It *feels* like my cloud scheduler is "busy"?

Hi <@UTQJ5786P> - can you provide some more information about your agent and where you're running it? In particular I'm curious what agent type and what sorts of resources the machine has that you're running your agent on

Yes!  I'm running 2 agents, one is a kubernetes-agent (but that has some Saturn specific machinery, so it's easier if we talk about the other one)  I'm also running a Local Agent, both appear to have this issue.

what I usually observe is that both of my agents stop responding at the same time (this was around 5am Pacific time yesterday), and after that, restarting the agent will resolve the problem in about 30 minutes, but during those 30 minutes, the agents won't pick up anyflows

and at the 30-ish minute mark, there is a flurry of activity in the logs of picking up the flows that have been backed up

I'm not sure if this is related - but since we're develop a Prefect cloud integration in our product, we have automatically running integration tests that create flows and agents in our prefect cloud account, so I'm not sure if that creates additional load on our cloud scheduler?

Hm got it - and how are you restarting your agent? With a `sigterm` and `prefect agent start` ?

It's actually running in k8s even though it's not a k8s agent - I'm stopping the pod, and then restarting it

my cmd line is like this:

```prefect agent local start --label local-ia -p /home/jovyan/git-repos/saturn-internal-analytics```

ah understood - is it possible it takes some time for your pod to stop and start again, independent of the agent start command?

no i don't beleive so, because the agent is logging, it's just that it doens't pick up flows until 30 minutes later (and they are stuck in "scheduled" in the prefect cloud ui)  in addition, the prefect cloud UI recognizes my agents as active

Hm interesting; this isn't something I've encountered before but I do have some suspicious. Can you confirm the system time on your agent pod?

Hm that was a thought but now I'm less sure. The thing I can't explain is why the work would be released all at once; if there was clock drift I'd expect that the work was still released at the correct cadence, just offset by the drift