< Marvin> how to make the best use of prefect prometheus exp Prefect Community #ask-marvin

<@ULVA73B9P> how to make the best use of prefect-p...

Kiran

10/13/2025, 7:03 AM

@Marvin how to make the best use of prefect-prometheus exporter, prometheus and grafana for better monitoring of my deploymenhts

Marvin

10/13/2025, 7:03 AM

✅ thought for 311.1 seconds

Marvin

10/13/2025, 7:09 AM

Great question — you can get a lot of value by combining Prefect metrics with Prometheus and Grafana. Here’s a practical blueprint to get you from zero to “useful dashboards and alerts” quickly. I’ll assume Prefect 3.x; if you’re on 2.x, let me know and I’ll tweak the steps. High-level approach - Runtime metrics from your Prefect processes (workers and flow-run processes): expose a Prometheus /metrics endpoint using Prefect’s client metrics server. - Orchestration insights (success rate, duration, lateness) visualized in Grafana using PromQL queries against those metrics. - Alerts for failures, latency, and backlogs via Prometheus alerting rules (and optionally routed into your on-call tool). 1) Expose a Prometheus /metrics endpoint from Prefect There are two ways to get an endpoint the Prometheus server can scrape: A. Start Prefect’s client metrics server in your process - This spins up a lightweight HTTP server (default path

/metrics

) inside the process that runs your flow or worker. Code pattern (start it once, early in your runtime):

Copy code

from prefect.services import start_client_metrics_server

# start a single instance per process
start_client_metrics_server()

from prefect import flow

@flow
def my_flow():
    # your flow code
    ...

if __name__ == "__main__":
    my_flow()

- Defaults: the server has no required parameters. By convention it binds on a fixed port. You can configure the port via settings if needed (see notes below). - Recommended: Start this in your worker container entrypoint (or at the top of your flow module) so the endpoint is consistently available for Prometheus scraping. B. Enable via settings (if you prefer config over code) - Prefect 3 exposes client metrics settings behind a settings model (e.g., enablement and port). If you don’t want to add the explicit

start_client_metrics_server()

call, you can control the behavior via settings/environment variables and start it yourself wherever makes sense. - Common environment variables you’ll use: -

PREFECT_CLIENT_METRICS_ENABLED=true

PREFECT_CLIENT_METRICS_PORT=4201

(example) - If settings aren’t picked up in your environment, default to calling

start_client_metrics_server()

in code as shown above. References: - Prefect 3 docs - Prefect 3 settings reference - Client metrics settings (for enable/port details) 2) Tell Prometheus how to scrape the metrics Pick the service discovery strategy that matches your deployment: A. Kubernetes (recommended) - If you start the metrics server in your Prefect worker/runner pods, just add Prometheus annotations to the pods or Service. Example pod annotations:

Copy code

metadata:
  annotations:
    <http://prometheus.io/scrape|prometheus.io/scrape>: "true"
    <http://prometheus.io/port|prometheus.io/port>: "4201"
    <http://prometheus.io/path|prometheus.io/path>: "/metrics"

B. Static targets (VMs, Docker hosts, etc.) - Add your workers/hosts as scrape targets: Prometheus scrape config:

Copy code

scrape_configs:
  - job_name: "prefect-client"
    metrics_path: /metrics
    static_configs:
      - targets:
          - "<http://worker-1.example.com:4201|worker-1.example.com:4201>"
          - "<http://worker-2.example.com:4201|worker-2.example.com:4201>"

References: - Prometheus scrape configuration - Kubernetes service discovery 3) Build useful Grafana dashboards The exact metric names you’ll see depend on the Prefect version and what the client metrics server exposes in your environment. After you enable the endpoint, visit http//<host><port>/metrics to inspect metric names/labels, then wire up these common panels: - Flow runs started per minute (rate) - Start with a counter like `<your_metric>_flow_runs_started_total`: - Example pattern:

sum(rate(prefect_flow_runs_started_total[5m]))

- Flow run success rate

Marvin

10/13/2025, 7:09 AM

sum(rate(prefect_flow_runs_completed_total[5m])) / sum(rate(prefect_flow_runs_started_total[5m]))

- Flow run duration P95 - If a duration histogram is present (e.g.,

prefect_flow_run_duration_seconds_bucket

): -

histogram_quantile(0.95, sum(rate(prefect_flow_run_duration_seconds_bucket[5m])) by (le))

- Late or backlogged runs - Look for a lateness gauge/counter: - Example:

sum(rate(prefect_flow_runs_late_total[5m]))

or if it’s a gauge:

sum(prefect_flow_runs_late)

Tip: Add labels like

deployment_name

work_pool_name

flow_name

to your panels to slice per deployment. References: - Grafana + Prometheus guide 4) Alerts that actually help Start with a small set of high-signal alerts: - High failure rate over 15 minutes

Copy code

expr: (sum(rate(prefect_flow_runs_failed_total[15m])) / sum(rate(prefect_flow_runs_started_total[15m]))) > 0.2
for: 10m
labels:
  severity: warning
annotations:
  description: "Flow-run failure rate > 20% in the last 15m"

- Persistent lateness

Copy code

expr: sum(prefect_flow_runs_late) > 0
for: 15m
labels:
  severity: warning
annotations:
  description: "There are late flow runs for 15m+"

- Worker missing/metrics scrape gap

Copy code

expr: absent(up{job="prefect-client"} == 1) == 1
for: 10m
labels:
  severity: critical
annotations:
  description: "No metrics from Prefect workers for 10m"

5) Production tips and gotchas - Start metrics once per process: If you run both a worker and multiple flow-run containers, run a metrics server in each pod/container that you want scraped. In Kubernetes, each pod’s port is isolated so you won’t get conflicts. - Keep labels sane: Avoid high-cardinality labels (e.g., per-run IDs) in your dashboards/alerts. Stick to

deployment_name

flow_name

work_pool_name

workspace

environment

. - Use recording rules: Precompute rates and quantiles for duration histograms to keep dashboards snappy. - Secure the endpoint: Restrict access to

/metrics

to your Prometheus server (network policies, sidecar auth if needed). - Combine infra metrics: Blend Prefect metrics with CPU/memory (Kube State Metrics, cAdvisor/Node Exporter) to correlate performance with run behavior. A note on “prefect-prometheus exporter” - If you’re referring to a specific exporter service that queries the Prefect API and exposes cross-workspace aggregated metrics, point me to the repo you’re using. Prefect 3’s built-in client metrics server is the most direct path to Prometheus; a separate exporter can be useful if you want one service to centrally compute org-level metrics by calling the Prefect API. - Happy to help you wire that up too — just share the exporter you’re targeting. Questions so I can tailor this to you 1) Where are you running Prefect (Kubernetes, Docker, VMs)? Cloud or self-hosted? 2) Do you want metrics per worker/pod and per flow-run container, or just fleet-level aggregates? 3) When you say “prefect-prometheus exporter,” are you using a specific repo/chart already? If so, which one? Once I have those details, I can give you concrete manifests (K8s annotations, Prometheus scrape config, and a ready-to-import Grafana dashboard JSON) that match your setup.

4 Views

Open in Slack

Previous Next