Hi morning. I want to check whether there was an i...
# prefect-cloud
k
Hi morning. I want to check whether there was an issue with prefect 2 cloud yesterday at around 7 pm London time? All my agent has lost connection to the server and terminated the agent process
1
c
I second this. I woke up to 2 of my work queue agents being down completely and the third one had everything Pending status. Something must have happened.
b
He team! After checking the status page, there was a period of UI latency starting at 18:59 UTC*
k
Can you suggest any good way to keep agents alive when the server is down?
b
It has since been resolved, are you still seeing some odd behaviour?
k
It's back to work, but I have lost the window when data can be loaded as agents were down
c
Same here. Agents went down for hours. Woke up to 2 of them Unhealthy and had to get into the servers and restart the work queues manually.
g
Daemonizing agents is always the suggested path to ensure that the service restarts in the event of any failure, here is an example of using systemd to daemonize
c
FYI I have all prefect agents daemonized and they did not restart after they went into Unhealthy status. Whatever happened, daemonizing didn’t help with that.
g
Can you explain a bit more? If the agents restarted successfully then what about the work queues would need to change?
c
They did NOT restart successfully. I had to manually go into the servers and restart the services myself.
They went into Unhealthy status overnight and stayed that way. All flow runs scheduled during that time went into Late status or stayed Pending forever
b
We'll take this information back to our engineering team, I agree that this is something that needs to be looked at in more detail
g
There was an issue where agents and workers would stop polling on subsequent HTTP errors but wouldn’t fully crash so that daemonization or Kubernetes wouldn’t know to restart the service. That was fixed in this PR which was released with 2.10.5. If these were agents on an older version, I suspect that may be the root cause.
👍 1
c
Noted. I will go ahead and update prefect info all my work queue agents to the latest version to account for that fix.