https://prefect.io logo
Title
k

Kirill Egorov

04/28/2023, 5:47 AM
Hi morning. I want to check whether there was an issue with prefect 2 cloud yesterday at around 7 pm London time? All my agent has lost connection to the server and terminated the agent process
1
c

Carlos Cueto

04/28/2023, 2:04 PM
I second this. I woke up to 2 of my work queue agents being down completely and the third one had everything Pending status. Something must have happened.
b

Bianca Hoch

04/28/2023, 2:35 PM
He team! After checking the status page, there was a period of UI latency starting at 18:59 UTC*
k

Kirill Egorov

04/28/2023, 2:36 PM
Can you suggest any good way to keep agents alive when the server is down?
b

Bianca Hoch

04/28/2023, 2:36 PM
It has since been resolved, are you still seeing some odd behaviour?
k

Kirill Egorov

04/28/2023, 2:37 PM
It's back to work, but I have lost the window when data can be loaded as agents were down
c

Carlos Cueto

04/28/2023, 2:38 PM
Same here. Agents went down for hours. Woke up to 2 of them Unhealthy and had to get into the servers and restart the work queues manually.
g

George Coyne

04/28/2023, 2:54 PM
Daemonizing agents is always the suggested path to ensure that the service restarts in the event of any failure, here is an example of using systemd to daemonize
c

Carlos Cueto

04/28/2023, 2:55 PM
FYI I have all prefect agents daemonized and they did not restart after they went into Unhealthy status. Whatever happened, daemonizing didn’t help with that.
g

George Coyne

04/28/2023, 3:09 PM
Can you explain a bit more? If the agents restarted successfully then what about the work queues would need to change?
c

Carlos Cueto

04/28/2023, 3:11 PM
They did NOT restart successfully. I had to manually go into the servers and restart the services myself.
They went into Unhealthy status overnight and stayed that way. All flow runs scheduled during that time went into Late status or stayed Pending forever
b

Bianca Hoch

04/28/2023, 3:40 PM
We'll take this information back to our engineering team, I agree that this is something that needs to be looked at in more detail
g

George Coyne

04/28/2023, 4:31 PM
There was an issue where agents and workers would stop polling on subsequent HTTP errors but wouldn’t fully crash so that daemonization or Kubernetes wouldn’t know to restart the service. That was fixed in this PR which was released with 2.10.5. If these were agents on an older version, I suspect that may be the root cause.
👍 1
c

Carlos Cueto

04/28/2023, 4:37 PM
Noted. I will go ahead and update prefect info all my work queue agents to the latest version to account for that fix.