Vincent Chéry
08/10/2022, 11:36 AM--expose
flag so runners can communicate with the server API through host.docker.internal
.
We run ~5 to 6 flows every minute and most of the time, it works like a charm with a stable resource usage at ~30% CPU and 15Go of available memory out of 24Go, but from time to time, we see huge spikes in CPU and memory usage on the machine, which can lead to complete breakdown and machine restart.docker-stats
to see which docker containers were causing these spikes, and in this occurrence of intensive resource usage I found the problem comes from a particular flow-run "thankful-corgi"23:33:21 : Failed to set task state with error: ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='host.docker.internal', port=4200): Read timed out. (read timeout=15)"))
Then
23:44:27 : Rescheduled by a Lazarus process. This is attempt 1.
This resceduled flow run failed again due to a timeout of the API :
23:58:57 : Failed to send heartbeat with exception: ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='host.docker.internal', port=4200): Read timed out. (read timeout=15)"))
Which was resceduled again :
00:34:28 Rescheduled by a Lazarus process. This is attempt 2.
Which failed again due to API timeout
00:34:37 : Failed to send heartbeat with exception: ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='host.docker.internal', port=4200): Read timed out. (read timeout=15)"))
Which was resceduled a third and last time and failed again :
01:05:14 Rescheduled by a Lazarus process. This is attempt 3.
01:35:15 : A Lazarus process attempted to reschedule this run 3 times without success. Marking as failed.
$ prefect diagnostics
{
"config_overrides": {},
"env_vars": [],
"system_information": {
"platform": "Linux-4.19.0-21-amd64-x86_64-with-glibc2.2.5",
"prefect_backend": "server",
"prefect_version": "0.15.11",
"python_version": "3.8.13"
}
}
Benson Mwangi
08/26/2022, 10:15 PMVincent Chéry
08/29/2022, 11:45 AMhost.docker.internal
as reported in this issue : https://github.com/docker/for-win/issues/8861
I'm currenlty in the process of moving the Prefect Server instance to a separate VM to get around these two potential problems, hoping it will solve the problem