Hi! I'm having a problem with that's been impactin...
# prefect-community
v
Hi! I'm having a problem with that's been impacting our system in production for months now. In an apparently random fashion, from once a month to once every few days, a request from a flow runner to the server API times out, and that's the start of a major sh*t storm which usually ends up in having to manually restart the machine. (Details in the next messages...)
We run a self-hosted Prefect Server (0.15.11) and run flows on the same machine as DockerRuns. Prefect Server is started with the
--expose
flag so runners can communicate with the server API through
host.docker.internal
. We run ~5 to 6 flows every minute and most of the time, it works like a charm with a stable resource usage at ~30% CPU and 15Go of available memory out of 24Go, but from time to time, we see huge spikes in CPU and memory usage on the machine, which can lead to complete breakdown and machine restart.
During the night between yesterday and today there were 3 spikes in memory usage starting at 23:44, 00:34 and 01:35. Last week I starting logging
docker-stats
to see which docker containers were causing these spikes, and in this occurrence of intensive resource usage I found the problem comes from a particular flow-run "thankful-corgi"
Looking at the logs of this flow-run in our UI, I see that the flowrun was running fine until :
23:33:21 : Failed to set task state with error: ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='host.docker.internal', port=4200): Read timed out. (read timeout=15)"))
Then
23:44:27 : Rescheduled by a Lazarus process. This is attempt 1.
This resceduled flow run failed again due to a timeout of the API :
23:58:57 : Failed to send heartbeat with exception: ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='host.docker.internal', port=4200): Read timed out. (read timeout=15)"))
Which was resceduled again :
00:34:28 Rescheduled by a Lazarus process. This is attempt 2.
Which failed again due to API timeout
00:34:37 : Failed to send heartbeat with exception: ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='host.docker.internal', port=4200): Read timed out. (read timeout=15)"))
Which was resceduled a third and last time and failed again :
01:05:14 Rescheduled by a Lazarus process. This is attempt 3.
01:35:15 : A Lazarus process attempted to reschedule this run 3 times without success. Marking as failed.
I don't understand why this happens and in particular : • Why does the restarted flow-run eats so much CPU and memory - the task executed in the flow-run is not resource intensive in itself • Why, once a runner container starts to have communication problems with the server, it keeps failing to reach the server, although other flow-runs running in parallel at the same time can communicate with the server API just fine
At this point the options I consider are: • upgrade to 1.3 (https://github.com/PrefectHQ/prefect/pull/5825 might help) • deactivate Lazarus process; although I'm not sure this is a good idea as it could lead to other problems ?
Any ideas about why this happens and how else I could solve it are welcome !
I forgot :
Copy code
$ prefect diagnostics
{
  "config_overrides": {},
  "env_vars": [],
  "system_information": {
    "platform": "Linux-4.19.0-21-amd64-x86_64-with-glibc2.2.5",
    "prefect_backend": "server",
    "prefect_version": "0.15.11",
    "python_version": "3.8.13"
  }
}
b
Hi @Vincent Chéry landed into this very issue recently. We're you able to diagnose what's wrong? Thank you.
v
Hi Benson, happy to see I'm not the only one haha! I did not identify the cause of the initial read timeout that randomly occurs. Two things I suspect could play a role are : • too many services running on the same machine : if a flow run takes all resources for some time, Prefect Server cannot handle the requests and the request hits the 15s timeout • some weird random docker networking issue related to the use of
host.docker.internal
as reported in this issue : https://github.com/docker/for-win/issues/8861 I'm currenlty in the process of moving the Prefect Server instance to a separate VM to get around these two potential problems, hoping it will solve the problem