Hi I m having a problem with that s been impacting our syste Prefect Community #ask-community

Hi! I'm having a problem with that's been impactin...

Vincent Chéry

08/10/2022, 11:36 AM

Hi! I'm having a problem with that's been impacting our system in production for months now. In an apparently random fashion, from once a month to once every few days, a request from a flow runner to the server API times out, and that's the start of a major sh*t storm which usually ends up in having to manually restart the machine. (Details in the next messages...)

Vincent Chéry

08/10/2022, 11:39 AM

We run a self-hosted Prefect Server (0.15.11) and run flows on the same machine as DockerRuns. Prefect Server is started with the

--expose

flag so runners can communicate with the server API through

host.docker.internal

. We run ~5 to 6 flows every minute and most of the time, it works like a charm with a stable resource usage at ~30% CPU and 15Go of available memory out of 24Go, but from time to time, we see huge spikes in CPU and memory usage on the machine, which can lead to complete breakdown and machine restart.

Vincent Chéry

08/10/2022, 11:43 AM

During the night between yesterday and today there were 3 spikes in memory usage starting at 23:44, 00:34 and 01:35. Last week I starting logging

docker-stats

to see which docker containers were causing these spikes, and in this occurrence of intensive resource usage I found the problem comes from a particular flow-run "thankful-corgi"

Vincent Chéry

08/10/2022, 11:46 AM

Looking at the logs of this flow-run in our UI, I see that the flowrun was running fine until :

23:33:21 : Failed to set task state with error: ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='host.docker.internal', port=4200): Read timed out. (read timeout=15)"))

Then

23:44:27 : Rescheduled by a Lazarus process. This is attempt 1.

This resceduled flow run failed again due to a timeout of the API :

23:58:57 : Failed to send heartbeat with exception: ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='host.docker.internal', port=4200): Read timed out. (read timeout=15)"))

Which was resceduled again :

00:34:28 Rescheduled by a Lazarus process. This is attempt 2.

Which failed again due to API timeout

00:34:37 : Failed to send heartbeat with exception: ReadTimeout(ReadTimeoutError("HTTPConnectionPool(host='host.docker.internal', port=4200): Read timed out. (read timeout=15)"))

Which was resceduled a third and last time and failed again :

01:05:14 Rescheduled by a Lazarus process. This is attempt 3.

01:35:15 : A Lazarus process attempted to reschedule this run 3 times without success. Marking as failed.

Vincent Chéry

08/10/2022, 11:54 AM

I don't understand why this happens and in particular : • Why does the restarted flow-run eats so much CPU and memory - the task executed in the flow-run is not resource intensive in itself • Why, once a runner container starts to have communication problems with the server, it keeps failing to reach the server, although other flow-runs running in parallel at the same time can communicate with the server API just fine

Vincent Chéry

08/10/2022, 12:02 PM

At this point the options I consider are: • upgrade to 1.3 (https://github.com/PrefectHQ/prefect/pull/5825 might help) • deactivate Lazarus process; although I'm not sure this is a good idea as it could lead to other problems ?

Vincent Chéry

08/10/2022, 12:06 PM

Any ideas about why this happens and how else I could solve it are welcome !

Vincent Chéry

08/10/2022, 12:09 PM

I forgot :

Copy code

$ prefect diagnostics
{
  "config_overrides": {},
  "env_vars": [],
  "system_information": {
    "platform": "Linux-4.19.0-21-amd64-x86_64-with-glibc2.2.5",
    "prefect_backend": "server",
    "prefect_version": "0.15.11",
    "python_version": "3.8.13"
  }
}

Benson Mwangi

08/26/2022, 10:15 PM

Hi @Vincent Chéry landed into this very issue recently. We're you able to diagnose what's wrong? Thank you.

Vincent Chéry

08/29/2022, 11:45 AM

Hi Benson, happy to see I'm not the only one haha! I did not identify the cause of the initial read timeout that randomly occurs. Two things I suspect could play a role are : • too many services running on the same machine : if a flow run takes all resources for some time, Prefect Server cannot handle the requests and the request hits the 15s timeout • some weird random docker networking issue related to the use of

host.docker.internal

as reported in this issue : https://github.com/docker/for-win/issues/8861 I'm currenlty in the process of moving the Prefect Server instance to a separate VM to get around these two potential problems, hoping it will solve the problem

7 Views

Open in Slack

Previous Next