https://prefect.io logo
Title
l

Louis Burtz

04/12/2021, 2:56 AM
Hello Prefect team, Is there a way to log/monitor CPU and RAM usage of an entire Flow? Or maybe monitoring the usage of an Agent, or even better the total usage in the platform that is running the Agent?
Rationale 1: I recently had a memory leak in a task. After several 100 tasks, my LocalAgent would fail with just the error message
[2021-04-11 14:35:43,952] INFO - agent | Process PID 398412 returned non-zero exit code
and the ongoing task would stay in the
Runing
state forever, the host of the LocalAgent gently idling, with lots of free RAM. Restarting the LocalAgent and using the UI to manually set the Flow state to Pending and the Resume would fix everything and the flow would continue until the next crash. My logging at the task time scale did not reveal a significant increase in RAM usage for each individual task -> hence the thinking to get this logging at the FLow or Agent time scales!
Rationale 2: monitoring at Flow or Agent time scales would give insights into host utilization / parallelization efficiencies etc
Thank you in advance for any pointers or recipes to address this use case! I could find many memory related threads on slack but nothing that matched the Flow level monitoring mentioned above
d

davzucky

04/12/2021, 4:08 AM
We are doing that using prometheus. I don't think this should be part of prefect. Did you look at that?
πŸ‘ 3
l

Louis Burtz

04/12/2021, 5:34 AM
Thanks a lot for the quick pointer, I haven't used Prometheus yet! Do you know of a recipe / example usage of Prometheus with Prefect Agents or Flows? Do you create a Prefect task that kicks the Prometheus logging at the start/end of each Flow? or do you kick Prometheus when starting Agents? How do you do error handling with e.g Flows that don't succeed / Agents that crash? Sorry for the many questions i'm trying to figure out architecture wise at which level Prometheus would make most sense within the Prefect workflow
m

Mariia Kerimova

04/12/2021, 4:00 PM
Hello Louis! I would highly recommend this helm chart for deploying Prometheus and other monitoring components. Prometheus doesn't provide logging, but scrapes metrics from your pods. If you are going to use that helm chart, you can use Grafana to visualize those metrics (Grafana actually comes with a few default dashboards that you can use). You will be able to see cpu/memory, etc for your pods πŸ™‚ I personally like this default dashboard: (but I built a few custom too)
πŸ‘ 2
Don't hesitate to ask questions about deploying this chart πŸ™‚
l

Louis Burtz

04/13/2021, 12:50 AM
Thanks a lot! it feels unique to get fast pointers to true and tested solutions! My current use case is not using kubernetes (yet?), I'm instead focused on automating ML research workflows on a single RTX3090 workstation. (and loving scheduling work that gets done hands free over nights and weekends thanks to Prefect). I'll dig deeper into Prometheus/Helm/Grafana and the like to find a solution with low overhead and flexibility. Out of curiosity: from Prefect's point of view would it make sense for the Agents to send back some basic system monitoring stats together with the current 'heartbeat' messages? Some metrics that would be available in the Prefect UI e.g. in the Agent section? Or would that be opening a pandora's box that isn't aligned with Prefect's core functionalities?
m

Mariia Kerimova

04/13/2021, 11:36 AM
Oh, for some reason I assumed that you’re using Kubernetes. Although you can run Prometheus and Grafana on your machine without Kubernetes. Adding metrics to UI is actually on our roadmap, it definitely will provide more visibility.
l

Louis Burtz

04/14/2021, 12:46 AM
Thanks for that insight too!
πŸ‘ 1