<@ULVA73B9P> how to make the best use of prefect-p...
# ask-marvin
k
@Marvin how to make the best use of prefect-prometheus exporter, prometheus and grafana for better monitoring of my deploymenhts
m
thought for 311.1 seconds
Great question — you can get a lot of value by combining Prefect metrics with Prometheus and Grafana. Here’s a practical blueprint to get you from zero to “useful dashboards and alerts” quickly. I’ll assume Prefect 3.x; if you’re on 2.x, let me know and I’ll tweak the steps. High-level approach - Runtime metrics from your Prefect processes (workers and flow-run processes): expose a Prometheus /metrics endpoint using Prefect’s client metrics server. - Orchestration insights (success rate, duration, lateness) visualized in Grafana using PromQL queries against those metrics. - Alerts for failures, latency, and backlogs via Prometheus alerting rules (and optionally routed into your on-call tool). 1) Expose a Prometheus /metrics endpoint from Prefect There are two ways to get an endpoint the Prometheus server can scrape: A. Start Prefect’s client metrics server in your process - This spins up a lightweight HTTP server (default path
/metrics
) inside the process that runs your flow or worker. Code pattern (start it once, early in your runtime):
Copy code
from prefect.services import start_client_metrics_server

# start a single instance per process
start_client_metrics_server()

from prefect import flow

@flow
def my_flow():
    # your flow code
    ...

if __name__ == "__main__":
    my_flow()
- Defaults: the server has no required parameters. By convention it binds on a fixed port. You can configure the port via settings if needed (see notes below). - Recommended: Start this in your worker container entrypoint (or at the top of your flow module) so the endpoint is consistently available for Prometheus scraping. B. Enable via settings (if you prefer config over code) - Prefect 3 exposes client metrics settings behind a settings model (e.g., enablement and port). If you don’t want to add the explicit
start_client_metrics_server()
call, you can control the behavior via settings/environment variables and start it yourself wherever makes sense. - Common environment variables you’ll use: -
PREFECT_CLIENT_METRICS_ENABLED=true
-
PREFECT_CLIENT_METRICS_PORT=4201
(example) - If settings aren’t picked up in your environment, default to calling
start_client_metrics_server()
in code as shown above. References: - Prefect 3 docs - Prefect 3 settings reference - Client metrics settings (for enable/port details) 2) Tell Prometheus how to scrape the metrics Pick the service discovery strategy that matches your deployment: A. Kubernetes (recommended) - If you start the metrics server in your Prefect worker/runner pods, just add Prometheus annotations to the pods or Service. Example pod annotations:
Copy code
metadata:
  annotations:
    <http://prometheus.io/scrape|prometheus.io/scrape>: "true"
    <http://prometheus.io/port|prometheus.io/port>: "4201"
    <http://prometheus.io/path|prometheus.io/path>: "/metrics"
B. Static targets (VMs, Docker hosts, etc.) - Add your workers/hosts as scrape targets: Prometheus scrape config:
Copy code
scrape_configs:
  - job_name: "prefect-client"
    metrics_path: /metrics
    static_configs:
      - targets:
          - "<http://worker-1.example.com:4201|worker-1.example.com:4201>"
          - "<http://worker-2.example.com:4201|worker-2.example.com:4201>"
References: - Prometheus scrape configuration - Kubernetes service discovery 3) Build useful Grafana dashboards The exact metric names you’ll see depend on the Prefect version and what the client metrics server exposes in your environment. After you enable the endpoint, visit http//&lt;host&gt;<port>/metrics to inspect metric names/labels, then wire up these common panels: - Flow runs started per minute (rate) - Start with a counter like `<your_metric>_flow_runs_started_total`: - Example pattern:
sum(rate(prefect_flow_runs_started_total[5m]))
- Flow run success rate
-
sum(rate(prefect_flow_runs_completed_total[5m])) / sum(rate(prefect_flow_runs_started_total[5m]))
- Flow run duration P95 - If a duration histogram is present (e.g.,
prefect_flow_run_duration_seconds_bucket
): -
histogram_quantile(0.95, sum(rate(prefect_flow_run_duration_seconds_bucket[5m])) by (le))
- Late or backlogged runs - Look for a lateness gauge/counter: - Example:
sum(rate(prefect_flow_runs_late_total[5m]))
or if it’s a gauge:
sum(prefect_flow_runs_late)
Tip: Add labels like
deployment_name
,
work_pool_name
,
flow_name
to your panels to slice per deployment. References: - Grafana + Prometheus guide 4) Alerts that actually help Start with a small set of high-signal alerts: - High failure rate over 15 minutes
Copy code
expr: (sum(rate(prefect_flow_runs_failed_total[15m])) / sum(rate(prefect_flow_runs_started_total[15m]))) > 0.2
for: 10m
labels:
  severity: warning
annotations:
  description: "Flow-run failure rate > 20% in the last 15m"
- Persistent lateness
Copy code
expr: sum(prefect_flow_runs_late) > 0
for: 15m
labels:
  severity: warning
annotations:
  description: "There are late flow runs for 15m+"
- Worker missing/metrics scrape gap
Copy code
expr: absent(up{job="prefect-client"} == 1) == 1
for: 10m
labels:
  severity: critical
annotations:
  description: "No metrics from Prefect workers for 10m"
5) Production tips and gotchas - Start metrics once per process: If you run both a worker and multiple flow-run containers, run a metrics server in each pod/container that you want scraped. In Kubernetes, each pod’s port is isolated so you won’t get conflicts. - Keep labels sane: Avoid high-cardinality labels (e.g., per-run IDs) in your dashboards/alerts. Stick to
deployment_name
,
flow_name
,
work_pool_name
,
workspace
,
environment
. - Use recording rules: Precompute rates and quantiles for duration histograms to keep dashboards snappy. - Secure the endpoint: Restrict access to
/metrics
to your Prometheus server (network policies, sidecar auth if needed). - Combine infra metrics: Blend Prefect metrics with CPU/memory (Kube State Metrics, cAdvisor/Node Exporter) to correlate performance with run behavior. A note on “prefect-prometheus exporter” - If you’re referring to a specific exporter service that queries the Prefect API and exposes cross-workspace aggregated metrics, point me to the repo you’re using. Prefect 3’s built-in client metrics server is the most direct path to Prometheus; a separate exporter can be useful if you want one service to centrally compute org-level metrics by calling the Prefect API. - Happy to help you wire that up too — just share the exporter you’re targeting. Questions so I can tailor this to you 1) Where are you running Prefect (Kubernetes, Docker, VMs)? Cloud or self-hosted? 2) Do you want metrics per worker/pod and per flow-run container, or just fleet-level aggregates? 3) When you say “prefect-prometheus exporter,” are you using a specific repo/chart already? If so, which one? Once I have those details, I can give you concrete manifests (K8s annotations, Prometheus scrape config, and a ready-to-import Grafana dashboard JSON) that match your setup.