Hi
@Prefect! I have observed two similar issues with our self-hosted Prefect Server and am hoping to get some feedback from the community.
Prefect Server setup: We are running Prefect Server v2.9.0 in Google Kubernetes Engine (limits of 8cpu and 16gb memory, excessive to eliminate resources as a cause of the issues), and have a Postgres DB GCP Cloud SQL instance that stores the state (4cpu and 16GB memory, 10GB solid state storage). All Prefect Deployments are configured as Kubernetes Jobs.
In general, this setup works as expected. We can run many jobs at the same time, observe their progress, schedule them, cancel them, re-run them, etc.
However, we have experienced two issues in the last two weeks that have brought down our Prefect Server (running jobs fail, UI becomes unresponsive).
Issues:
1. One of our
deployments/jobs creates many many logs (let's say tens of thousands). When that job runs, the logs in the UI become very slow to load. After a couple refreshes, Prefect Server resource usage (cpu and memory) spike, the Postgres metrics spike (cpu usage, ingress/egress bytes, etc), the UI becomes unresponsive, and currently running jobs fail because they loose connection to Prefect Server.
Resolution: I can resolve this issue by not looking at the logs for that job. Or, of course, I can just not write so many logs.
2. Another deployment/job
returns a large data set from one flow which is consumed by the next flow (let's say around 1GB in size). When this stage of the job runs, the UI becomes unresponsive, Connection Timeout errors are thrown by Prefect Server, Prefect Server cpu and memory spike, Postgres metrics spike, and ultimately Prefect Server goes down.
Resolution: The only way I have been able to resolve this issue is to restore the Postgres DB to a previous state. Once restored, if we remove the passing of data from one flow to the next, the issue is resolved.
Although both of these issues are avoidable (don't create so many logs, don't pass large dataset between flows), this seems to point to an application issue with how Prefect Server communicates the the Postgres DB. Has anyone else observed similar behavior? Or have suggestions for how to be more resilient to single jobs taking down the entire Prefect Server?