https://prefect.io logo
Title
s

Scott Walsh

04/20/2023, 3:03 AM
Hi @Prefect! I have observed two similar issues with our self-hosted Prefect Server and am hoping to get some feedback from the community. Prefect Server setup: We are running Prefect Server v2.9.0 in Google Kubernetes Engine (limits of 8cpu and 16gb memory, excessive to eliminate resources as a cause of the issues), and have a Postgres DB GCP Cloud SQL instance that stores the state (4cpu and 16GB memory, 10GB solid state storage). All Prefect Deployments are configured as Kubernetes Jobs. In general, this setup works as expected. We can run many jobs at the same time, observe their progress, schedule them, cancel them, re-run them, etc. However, we have experienced two issues in the last two weeks that have brought down our Prefect Server (running jobs fail, UI becomes unresponsive). Issues: 1. One of our deployments/jobs creates many many logs (let's say tens of thousands). When that job runs, the logs in the UI become very slow to load. After a couple refreshes, Prefect Server resource usage (cpu and memory) spike, the Postgres metrics spike (cpu usage, ingress/egress bytes, etc), the UI becomes unresponsive, and currently running jobs fail because they loose connection to Prefect Server. Resolution: I can resolve this issue by not looking at the logs for that job. Or, of course, I can just not write so many logs. 2. Another deployment/job returns a large data set from one flow which is consumed by the next flow (let's say around 1GB in size). When this stage of the job runs, the UI becomes unresponsive, Connection Timeout errors are thrown by Prefect Server, Prefect Server cpu and memory spike, Postgres metrics spike, and ultimately Prefect Server goes down. Resolution: The only way I have been able to resolve this issue is to restore the Postgres DB to a previous state. Once restored, if we remove the passing of data from one flow to the next, the issue is resolved. Although both of these issues are avoidable (don't create so many logs, don't pass large dataset between flows), this seems to point to an application issue with how Prefect Server communicates the the Postgres DB. Has anyone else observed similar behavior? Or have suggestions for how to be more resilient to single jobs taking down the entire Prefect Server?
👀 1
d

Deceivious

04/21/2023, 8:23 AM
Just curious, What is the data size being returned in issue#2? Would like to try and avoid this. 😄
s

Scott Walsh

04/21/2023, 2:48 PM
It was around 1GB. I’m interested to see if you try this with a large dataset if you experience the same issue.
d

Deceivious

04/21/2023, 2:49 PM
Th ats pretty huge to be caching :D My caches are in KBs
s

Scott Walsh

04/21/2023, 2:50 PM
Ya as I’m writing that I am wondering if it was actually that big. Will need to check.
d

Deceivious

04/21/2023, 2:51 PM
Unsure but I run flows on distributed env - ie kubernetes on remote machines so shared disk isnt a thing. 1 GB network transfers seems huge
I have noticed similar issues with server throwing lots of errors and UI tries to fetch logs but never succeeds. But I havnt traced the issue back to the source.
👍 1
s

Scott Walsh

04/21/2023, 2:56 PM
So this was running in a single pod, and a sub flow was returning data that was used by the next sub flow. So all happening in a single parent flow. And I think I am wrong with the 1GB size, my mistake, I’m sure it was much smaller than that.
Because the UI let’s you see a flow/sub flows parameters, I am assuming it was trying to store all of that dataset and then display in the UI.
d

Deceivious

04/21/2023, 3:04 PM
thats highly likely if you are passing the huge data set result from task one to task 2 as parameter. Maybe pass the file name where you save the data instead.