Daniel Chan
07/01/2025, 2:29 AMMarvin
07/01/2025, 2:29 AMMarvin
07/01/2025, 2:30 AMstart_client_metrics_server()
even when PREFECT_CLIENT_METRICS_ENABLED
is set to true. Here's why:
1. PREFECT_CLIENT_METRICS_ENABLED
only enables the collection of metrics
2. start_client_metrics_server()
is required to actually expose those metrics via an HTTP endpoint that Prometheus can scrape
Here's a typical setup in your flow code:
python
from prefect import flow
from prefect.utilities.services import start_client_metrics_server
# Start the metrics server
start_client_metrics_server()
@flow
def my_flow():
# Your flow code here
pass
Make sure you've also set:
- PREFECT_CLIENT_METRICS_ENABLED=true
in your environment
- PREFECT_CLIENT_METRICS_PORT
(optional, defaults to 9090) to specify the port where metrics will be exposed
Once this is set up, your PodMonitor will be able to scrape the metrics from the exposed endpoint. The metrics will be available at http://<pod-ip>:<metrics-port>/metrics
.
This separation of enabling metrics collection and starting the server allows for more flexibility in how and when you expose metrics, which can be particularly useful in different deployment scenarios.Daniel Chan
07/01/2025, 2:32 AMMarvin
07/01/2025, 2:32 AMMarvin
07/01/2025, 2:33 AMPREFECT_CLIENT_METRICS_ENABLED
setting should be configured in your flow/job configuration, not the worker configuration. Here's why:
1. The metrics are collected for the specific flow run (job) execution, tracking things like flow run duration, task run states, etc.
2. The worker itself doesn't generate these metrics - it just orchestrates the execution of flows
3. Each flow run operates as its own process with its own metrics server
Therefore, you should:
1. Set PREFECT_CLIENT_METRICS_ENABLED=true
in your flow's deployment configuration or job environment variables
2. Include the start_client_metrics_server()
call in your flow code
3. Configure your PodMonitor to target the pods running your flows, not the worker pods
For example, in a Kubernetes deployment configuration:
yaml
apiVersion: <http://prefect.io/v1alpha1|prefect.io/v1alpha1>
kind: PrefectDeployment
metadata:
name: my-flow
spec:
flow:
name: my-flow
env:
- name: PREFECT_CLIENT_METRICS_ENABLED
value: "true"
- name: PREFECT_CLIENT_METRICS_PORT
value: "9090" # optional, defaults to 9090
This ensures that the metrics are collected and exposed for each individual flow run, giving you granular monitoring capabilities for your workflows.Daniel Chan
07/01/2025, 2:34 AMMarvin
07/01/2025, 2:34 AMMarvin
07/01/2025, 2:34 AMPREFECT_CLIENT_METRICS_PORT
is 4201, not 9090 as I incorrectly stated. I apologize for providing incorrect information.
To correct my earlier response: if you don't explicitly set PREFECT_CLIENT_METRICS_PORT
, Prefect will expose the metrics endpoint on port 4201. If you need to change this (for example, if there's a port conflict), you can set it to a different value in your configuration.
Thank you for the correction. It's important to be precise with these configuration details to avoid confusion or deployment issues.Daniel Chan
07/01/2025, 2:34 AMMarvin
07/01/2025, 2:34 AMMarvin
07/01/2025, 2:35 AMPREFECT_LOGGING_ORION_ENABLED
is actually a Prefect 2.x setting and is not used in Prefect 3.x. In Prefect 3.x, the equivalent functionality is controlled by PREFECT_LOGGING_TO_API_ENABLED
, which is enabled by default.
For your metrics collection use case, you only need:
1. PREFECT_CLIENT_METRICS_ENABLED=true
2. start_client_metrics_server()
in your flow code
3. Optionally, PREFECT_CLIENT_METRICS_PORT
if you want to change the default port (4201)
You don't need to configure any additional logging settings unless you have specific requirements for how your flow logs are handled. The default logging configuration in Prefect 3.x will automatically send logs to the Prefect API.Daniel Chan
07/01/2025, 2:38 AMMarvin
07/01/2025, 2:38 AMMarvin
07/01/2025, 2:38 AMyaml
# NetworkPolicy (if you're using network policies)
apiVersion: <http://networking.k8s.io/v1|networking.k8s.io/v1>
kind: NetworkPolicy
metadata:
name: allow-prometheus-scraping
spec:
podSelector:
matchLabels:
# match your Prefect flow pod labels
app: your-prefect-flow
policyTypes:
- Ingress
ingress:
- from:
- namespaceSelector:
matchLabels:
# adjust based on where your Prometheus is running
<http://kubernetes.io/metadata.name|kubernetes.io/metadata.name>: monitoring
ports:
- protocol: TCP
port: 4201
---
# PodMonitor
apiVersion: <http://monitoring.coreos.com/v1|monitoring.coreos.com/v1>
kind: PodMonitor
metadata:
name: prefect-flow-monitor
spec:
selector:
matchLabels:
# match your Prefect flow pod labels
app: your-prefect-flow
podMetricsEndpoints:
- port: metrics # or the name you've given to the port in your pod spec
path: /metrics
Make sure your Prefect flow pod specification includes the port:
yaml
spec:
containers:
- name: flow
ports:
- name: metrics
containerPort: 4201
Since this is all internal Kubernetes networking, you don't need to modify any external firewall rules unless:
1. You're trying to scrape metrics from outside the cluster
2. Your Prometheus instance is running outside the cluster
3. You have very restrictive NetworkPolicies in place
In those cases, you would need to adjust the rules accordingly, but for standard in-cluster monitoring, the above configurations should be sufficient.Daniel Chan
07/01/2025, 2:51 AMMarvin
07/01/2025, 2:51 AMMarvin
07/01/2025, 2:53 AMstart_client_metrics_server()
to your flow code, you only need to:
1. Ensure the Prefect job/flow run pod exposes port 4201
2. Configure your PodMonitor to watch the namespace where your Prefect jobs run
You don't need to expose port 4201 anywhere else in your workflow. The metrics are generated and exposed only by the flow run pods themselves, not by any other Prefect components (like workers or the Prefect server).
The typical setup would look like this:
1. In your flow code:
python
from prefect import flow
from prefect.utilities.services import start_client_metrics_server
start_client_metrics_server()
@flow
def my_flow():
# your flow code here
pass
2. In your PodMonitor configuration (pointing at your flow run pods):
yaml
apiVersion: <http://monitoring.coreos.com/v1|monitoring.coreos.com/v1>
kind: PodMonitor
metadata:
name: prefect-flow-monitor
spec:
selector:
matchLabels:
# whatever labels identify your Prefect flow run pods
<http://prefect.io/flow-run-name|prefect.io/flow-run-name>: ""
namespaceSelector:
matchNames:
- your-prefect-jobs-namespace
podMetricsEndpoints:
- port: metrics # or the actual port number 4201
path: /metrics
That's all you need - the Prometheus operator will handle the rest of the metrics collection process through Kubernetes' internal networking.Daniel Chan
07/01/2025, 3:56 AMMarvin
07/01/2025, 3:56 AMMarvin
07/01/2025, 3:58 AMlate_runs_count
, you would need to use the prometheus-prefect-exporter or a similar solution that scrapes the Prefect API. Here's why:
1. The client metrics server (enabled via start_client_metrics_server()
) only exposes metrics about the specific flow run instance it's running in
2. Work queue metrics, including late run counts, are API-level metrics that require querying the Prefect API
3. The prefect_info_work_queues
metric with its late_runs_count
label is specifically provided by the prometheus-prefect-exporter
This creates two distinct types of metrics collection in your Prefect setup:
1. Flow Run Metrics (via start_client_metrics_server()
):
- Individual flow run performance
- Task states
- Runtime metrics
- Exposed directly on port 4201
2. API/System Metrics (via prometheus-prefect-exporter):
- Work queue statistics
- Late run counts
- Global system metrics
- Requires API scraping
If work queue metrics are important for your monitoring needs, you'll need to set up both:
- PodMonitor for flow run metrics
- prometheus-prefect-exporter for API-level metrics like work queue statistics
This way you get a complete view of both the individual flow runs and the system-wide metrics like work queue performance.Daniel Chan
07/01/2025, 4:26 AMMarvin
07/01/2025, 4:26 AMMarvin
07/01/2025, 4:27 AMend_time
is showing as None
in the Prometheus metrics despite flows being completed in the UI does seem unexpected. The prefect_info_flow_runs
metric should indeed populate both start_time
and end_time
for completed flow runs, similar to how it handles start_time
.
Since you're seeing this discrepancy between the UI (showing completion) and the Prometheus metrics (showing None
for end_time
), there are a few things we should verify:
1. Are you using the latest version of the prometheus-prefect-exporter?
2. Can you verify that the flow runs are actually showing end times in the Prefect API directly? (You can check this via the API endpoint /flow-runs/{id}
)
3. Are there any errors in the prometheus-prefect-exporter logs?
This might be a bug in either:
- The prometheus-prefect-exporter's handling of end times
- The way the exporter is querying the Prefect API
- The API response formatting for end times
Would you be able to:
1. Check the version of your prometheus-prefect-exporter
2. Share a sample of the raw Prometheus metrics output for a completed flow run?
This would help determine if this is a known issue or if we should report it as a new bug.Daniel Chan
07/01/2025, 4:30 AMMarvin
07/01/2025, 4:30 AMMarvin
07/01/2025, 4:30 AMbash
curl http://<your-exporter-host>:9090/metrics | grep prefect_info_flow_runs
2. Through Prometheus UI:
- Go to your Prometheus UI
- Use the Query tab
- Enter the query: prefect_info_flow_runs
- Click "Execute"
- Switch to the "Table" view to see all labels and values
If you're running in Kubernetes, you can also get the metrics directly from the exporter pod:
bash
# First find the pod name
kubectl get pods -n <namespace> | grep prometheus-prefect-exporter
# Then curl the metrics endpoint from the pod
kubectl exec -n <namespace> <exporter-pod-name> -- curl localhost:9090/metrics | grep prefect_info_flow_runs
The output should look something like this:
prefect_info_flow_runs{deployment_id="...",flow_id="...",flow_name="...",flow_run_id="...",flow_run_name="...",start_time="...",end_time="...",state="..."} 1
This will help us verify if the end_time
field is consistently showing as None
and under what conditions. Since you're using version 1.7.0, which is the latest version, this might indicate a potential issue that needs to be reported.Daniel Chan
07/01/2025, 4:40 AMcurl <http://localhost:8000/metrics> | grep prefect_info_flow_runs
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 13172 100 13172 0 0 16389 0 --:--:-- --:--:-- --:--:-- 16403
# HELP prefect_info_flow_runs Prefect flow runs info
# TYPE prefect_info_flow_runs gauge
prefect_info_flow_runs{created="2025-07-01T04:36:22.340281Z",deployment_id="25a4caac-2326-4c53-9037-4ce3fbd67427",deployment_name="nucleus_ingestion",end_time="2025-07-01T04:38:08.870199Z",flow_id="14f42903-d3e9-4ffb-bab0-0d14d5af08bb",flow_name="Nucleus Ingestion",flow_run_id="ec94907b-ad24-4da6-a0ef-a85cd3323cb9",flow_run_name="dashing-hummingbird",run_count="1",start_time="2025-07-01T04:36:57.346805Z",state_id="df5e3c7d-76bf-413d-9667-259204462ffa",state_name="Completed",total_run_time="71.523394",work_queue_name="sideflows"} 0.0
prefect_info_flow_runs{created="2025-07-01T04:36:23.722145Z",deployment_id="25a4caac-2326-4c53-9037-4ce3fbd67427",deployment_name="nucleus_ingestion",end_time="2025-07-01T04:38:15.269128Z",flow_id="14f42903-d3e9-4ffb-bab0-0d14d5af08bb",flow_name="Nucleus Ingestion",flow_run_id="e26bb1ac-bf6a-4cf7-be10-9df7b106515b",flow_run_name="tan-barnacle",run_count="1",start_time="2025-07-01T04:37:00.427239Z",state_id="93e8a5d7-c9ff-4217-bcdb-9d70a242c63e",state_name="Completed",total_run_time="74.841889",work_queue_name="sideflows"} 0.0
prefect_info_flow_runs{created="2025-07-01T04:36:23.657403Z",deployment_id="25a4caac-2326-4c53-9037-4ce3fbd67427",deployment_name="nucleus_ingestion",end_time="2025-07-01T04:37:43.390337Z",flow_id="14f42903-d3e9-4ffb-bab0-0d14d5af08bb",flow_name="Nucleus Ingestion",flow_run_id="db75d2a3-0f47-4dde-a199-74b0fd4b1021",flow_run_name="blazing-turkey",run_count="1",start_time="2025-07-01T04:36:44.216575Z",state_id="9c695f64-a224-49ac-921b-33487fc56096",state_name="Completed",total_run_time="59.173762",work_queue_name="sideflows"} 0.0
prefect_info_flow_runs{created="2025-07-01T04:36:20.536407Z",deployment_id="25a4caac-2326-4c53-9037-4ce3fbd67427",deployment_name="nucleus_ingestion",end_time="2025-07-01T04:37:38.261706Z",flow_id="14f42903-d3e9-4ffb-bab0-0d14d5af08bb",flow_name="Nucleus Ingestion",flow_run_id="5fd40ae3-9af2-4f1c-9fea-7436639a5fa2",flow_run_name="xanthic-orca",run_count="1",start_time="2025-07-01T04:36:38.388683Z",state_id="4399abf6-5d2b-48c4-9f0f-874cf0500c49",state_name="Completed",total_run_time="59.873023",work_queue_name="sideflows"} 0.0
prefect_info_flow_runs{created="2025-07-01T04:36:22.390574Z",deployment_id="25a4caac-2326-4c53-9037-4ce3fbd67427",deployment_name="nucleus_ingestion",end_time="2025-07-01T04:37:41.802947Z",flow_id="14f42903-d3e9-4ffb-bab0-0d14d5af08bb",flow_name="Nucleus Ingestion",flow_run_id="343402e2-3321-4b55-8599-09d62e0592ff",flow_run_name="loose-clam",run_count="1",start_time="2025-07-01T04:36:42.445354Z",state_id="44e45d89-29e5-4269-9998-8db78d48f12d",state_name="Completed",total_run_time="59.357593",work_queue_name="sideflows"} 0.0
Interestingly, some of the flow_runs do have an end_time defined, but I don't see all of the end_times in the labels for my prefect_info_flow_runs.
Is it possible that my pod monitoring scraping every 15 seconds did not update the flow run? Should I reduce the interval for how often my PodMonitoring should scrape the metrics endpoint for my flow runs?Marvin
07/01/2025, 4:40 AMMarvin
07/01/2025, 4:40 AMend_time
values populated. For example:
flow_run_name="dashing-hummingbird",
start_time="2025-07-01T04:36:57.346805Z",
end_time="2025-07-01T04:38:08.870199Z",
state_name="Completed"
If you're not seeing end times for some flow runs, there could be a few reasons:
1. Scrape Interval Impact: Yes, the 15-second scrape interval could mean you might miss some state transitions. If a flow run starts and completes within that 15-second window, you might not capture all states. You could try reducing the interval, for example:
yaml
apiVersion: <http://monitoring.coreos.com/v1|monitoring.coreos.com/v1>
kind: PodMonitor
metadata:
name: prefect-flow-monitor
spec:
podMetricsEndpoints:
- interval: 5s # More frequent scraping
port: metrics
path: /metrics
2. Metric Retention: The prometheus-prefect-exporter might only keep metrics for a certain time window. Flow runs that completed outside this window might not appear in the metrics at all.
3. Batch Updates: The prometheus-prefect-exporter (v1.7.0) might batch its updates to the Prefect API, meaning there could be a slight delay between when a flow completes and when the metrics are updated.
To troubleshoot:
1. Try reducing the scrape interval first
2. Compare the timestamps of the flow runs you're not seeing with those that are appearing
3. Check if there's a pattern to which flow runs are missing end times (e.g., very short-running flows vs longer ones)
Would you be able to share an example of a flow run that you know is completed in the UI but isn't showing an end time in these metrics? That would help identify if there's a specific pattern to when this occurs.Daniel Chan
07/01/2025, 5:00 AMMarvin
07/01/2025, 5:00 AMMarvin
07/01/2025, 5:01 AMbash
# Get the exporter logs
kubectl logs <prometheus-prefect-exporter-pod> -n <namespace>
This might show if there are any errors or warnings during metric collection.
3. You can also check the specific flow runs in the Prefect API directly:
python
from prefect.client import get_client
async with get_client() as client:
flow_run = await client.read_flow_run("<flow-run-id>")
print(flow_run.end_time)
This would help verify if the end times are properly recorded in Prefect itself versus potentially being an issue with how the exporter is collecting or reporting the metrics.
If you can share:
1. The time range between when these flows completed
2. Whether the flows missing end times have any common characteristics
3. Whether this happens consistently for the same flows or if it's random
This would help narrow down the cause of the inconsistency.Daniel Chan
07/01/2025, 5:02 AMMarvin
07/01/2025, 5:02 AMMarvin
07/01/2025, 5:03 AMyaml
# In your prometheus-prefect-exporter deployment
env:
- name: PREFECT_API_POLL_INTERVAL
value: "10" # Decrease the interval (in seconds) to catch more updates
2. Increase Prometheus Scrape Frequency
yaml
apiVersion: <http://monitoring.coreos.com/v1|monitoring.coreos.com/v1>
kind: PodMonitor
spec:
podMetricsEndpoints:
- interval: 5s # More frequent scraping
port: metrics
path: /metrics
3. Check Exporter Logs for Timing
bash
kubectl logs -f <prometheus-prefect-exporter-pod> -n <namespace>
Look for patterns in when updates are being made and if there are any gaps.
This might be worth reporting as an issue to the prometheus-prefect-exporter repository, as identical flows from the same deployment should ideally show consistent behavior in metrics reporting. Could you try running a few more instances of the same deployment and monitor:
1. Are the missing end times random or do they follow a pattern?
2. Does increasing the scrape frequency help capture more end times?
3. Do the logs show any errors or warnings during the times when end times are missing?
This information would be valuable for troubleshooting and potentially improving the exporter's reliability.Daniel Chan
07/01/2025, 5:06 AMPREFECT_API_POLL_INTERVAL
?Marvin
07/01/2025, 5:06 AMMarvin
07/01/2025, 5:07 AMPREFECT_API_POLL_INTERVAL
. I cannot find documentation that confirms this environment variable exists or works with the prometheus-prefect-exporter. I should not have suggested a configuration option without being able to verify it in the documentation.
Instead, I would recommend:
1. Check the prometheus-prefect-exporter GitHub repository for documented configuration options
2. If you can't find the information there, you might want to:
- Open an issue asking about configuration options for controlling the polling interval
- Ask about best practices for handling inconsistent end time reporting
I apologize for any confusion my previous response may have caused by suggesting an unverified configuration option.Daniel Chan
07/01/2025, 3:24 PMMarvin
07/01/2025, 3:24 PMMarvin
07/01/2025, 3:25 PMDaniel Chan
07/01/2025, 3:26 PMMarvin
07/01/2025, 3:26 PMMarvin
07/01/2025, 3:27 PMDaniel Chan
07/01/2025, 3:32 PMMarvin
07/01/2025, 3:32 PMMarvin
07/01/2025, 3:32 PMDaniel Chan
07/01/2025, 3:35 PM# Calculate timestamps for before and after data
after_data = datetime.now(timezone.utc) - timedelta(minutes=offset_minutes)
self.after_data_fmt = after_data.strftime("%Y-%m-%dT%H:%M:%S.%fZ")
Marvin
07/01/2025, 3:35 PMMarvin
07/01/2025, 3:36 PMafter_data
represents the earliest time from which you want to collect metrics
- It's calculated as: current time minus the offset_minutes
- This becomes your starting point for data collection
2. The code is formatting this timestamp into ISO 8601 format (%Y-%m-%dT%H:%M:%S.%fZ
)
- Example: If current time is 2:00 PM and offset_minutes=30
- after_data
would be 1:30 PM
- This tells Prefect "get me all data from 1:30 PM onwards"
Despite the comment mentioning "before and after data", the code snippet only shows the "after" calculation. This is because:
- "After" sets the starting point of your data collection window
- The implicit "before" point is typically the current time
- Together they form a time window: "get data between after_data and now"
So if you had:
python
offset_minutes = 60
# At 2:00 PM, this would mean:
after_data = 1:00 PM
# Your query window would be: 1:00 PM to 2:00 PM
This is commonly used in monitoring systems to create rolling time windows for metric collection, ensuring you don't miss any data while also not collecting redundant information.Daniel Chan
07/01/2025, 3:39 PMMarvin
07/01/2025, 3:39 PMMarvin
07/01/2025, 3:39 PMpython
# If current time is 2:00 PM
after_data = 2:00 PM # (now - 0 minutes)
# Query window: "Get all flow runs from 2:00 PM onwards"
With `offset_minutes = 120`:
python
# If current time is 2:00 PM
after_data = 12:00 PM # (now - 120 minutes)
# Query window: "Get all flow runs from 12:00 PM onwards"
Here's what's crucial to understand:
1. This is just a query window start point
- It doesn't affect when flows run
- It doesn't affect how flows are recorded
- It only affects which flows the exporter will query for metrics
2. The Prometheus scraper runs repeatedly
- If scrape interval is 30 seconds, it queries every 30 seconds
- Each query gets the latest state of all flows
- No data is permanently missed
Example timeline:
2:00 PM: Flow A starts
2:01 PM: Prometheus scrapes (offset=0)
→ Sees Flow A in "Running" state
2:02 PM: Flow A completes
2:02 PM: Prometheus scrapes (offset=0)
→ Sees Flow A in "Completed" state
The offset doesn't affect whether you see the flow - it affects how far back you look each time you query.
Why NOT to use a large offset:
1. Redundant Data Collection
- Each scrape re-queries the same historical data
- With offset=120, you're repeatedly pulling 2 hours of history
- Same flows get queried multiple times unnecessarily
2. Resource Usage
- Larger queries = more API load
- More memory usage processing larger result sets
- Potential performance impact on your Prefect server
3. Prometheus Best Practices
- Prometheus is designed for current state and recent history
- Long historical lookbacks are better handled by proper retention policies
Recommended Approach:
1. Start with a small offset (0-5 minutes)
2. Adjust your scrape interval to match your needs
3. Use Prometheus's built-in retention and recording rules for historical data
The offset is not a "safety net" - it's a query parameter. You won't miss data with offset=0 as long as your scrape interval is appropriate for your workflow frequency.Daniel Chan
07/01/2025, 3:54 PMMarvin
07/01/2025, 3:54 PMMarvin
07/01/2025, 3:55 PMSCRAPE_INTERVAL_SECONDS=120
and `OFFSET_MINUTES=0`:
2:00 PM: Scrape happens
Query window: "Get flows from 2:00 PM onwards"
2:01 PM: Flow A runs and completes
2:02 PM: Flow B runs and completes
2:02 PM: Scrape happens
Query window: "Get flows from 2:02 PM onwards"
→ MISSES both Flow A and Flow B!
This is why OFFSET_MINUTES should always be greater than your SCRAPE_INTERVAL_SECONDS! Here's the correct setup:
With SCRAPE_INTERVAL_SECONDS=120
and `OFFSET_MINUTES=3`:
2:00 PM: Scrape happens
Query window: "Get flows from 1:57 PM onwards"
2:01 PM: Flow A runs and completes
2:02 PM: Flow B runs and completes
2:02 PM: Scrape happens
Query window: "Get flows from 1:59 PM onwards"
→ CAPTURES both Flow A and Flow B
Best Practice Rule:
OFFSET_MINUTES should be > (SCRAPE_INTERVAL_SECONDS / 60)
So for different scrape intervals:
- If `SCRAPE_INTERVAL_SECONDS=30`: OFFSET_MINUTES should be > 0.5
- If `SCRAPE_INTERVAL_SECONDS=60`: OFFSET_MINUTES should be > 1
- If `SCRAPE_INTERVAL_SECONDS=120`: OFFSET_MINUTES should be > 2
- If `SCRAPE_INTERVAL_SECONDS=300`: OFFSET_MINUTES should be > 5
To be safe, add a small buffer. For example, if your scrape interval is 120 seconds (2 minutes), set OFFSET_MINUTES to 3 or 4 to account for:
- Any slight delays in scraping
- API response time
- Network latency
- Clock skew between servers
Thank you for this question - it helped expose a critical relationship that I should have explained in my previous responses. This is exactly why OFFSET_MINUTES exists and why setting it to 0 can indeed cause you to miss metrics when used with longer scrape intervals.Daniel Chan
07/01/2025, 3:57 PMMarvin
07/01/2025, 3:57 PMMarvin
07/01/2025, 3:58 PMSCRAPE_INTERVAL_SECONDS = 5
OFFSET_MINUTES = 1
This configuration can work, but with some considerations:
Advantages:
- 1-minute lookback window is reasonable
- Reduces data volume compared to larger offsets
- Provides very fresh metrics
Potential Risks:
- 5-second interval might be aggressive depending on your scale
- Could put significant load on your Prefect API
- Might not give enough time for API to process between scrapes
Safer Alternative:
SCRAPE_INTERVAL_SECONDS = 15 # or 30
OFFSET_MINUTES = 1
Recommendation:
1. Start with a more conservative interval (15 or 30 seconds)
2. Monitor these metrics:
- Scrape duration
- API response times
- Server load
3. If everything looks good, gradually decrease the interval
4. Keep OFFSET_MINUTES=1 as it's a reasonable window
Remember:
- Shorter intervals = more real-time data but more API load
- The offset window of 1 minute is good - it gives enough overlap without excessive data retrieval
- Monitor scrape_duration_seconds
metric in Prometheus to see how long scrapes actually take
- If scrapes consistently take longer than your interval, increase the interval
The key is finding the balance between metric freshness and system stability. Your 1-minute offset is sensible, but consider starting with a more conservative scrape interval and adjusting based on observed performance.Daniel Chan
07/01/2025, 4:01 PM# Run the loop to collect Prefect metrics
while True:
time.sleep(scrape_interval_seconds)
Marvin
07/01/2025, 4:02 PMMarvin
07/01/2025, 4:02 PMwhile True
loop with time.sleep()
does NOT prevent concurrent scrapes. This is significantly different from how I described it.
Let's break down what actually happens:
python
while True:
time.sleep(scrape_interval_seconds)
# Scrape starts here
If `scrape_interval_seconds = 5`:
Time 0:00 - Scrape A starts
Time 0:05 - New scrape B starts (even if A is still running!)
Time 0:10 - New scrape C starts (even if A or B are still running!)
This means:
1. The exporter WILL start new scrapes regardless of whether previous scrapes finished
2. Multiple scrapes could run concurrently, potentially:
- Overwhelming your Prefect API
- Consuming excessive memory
- Creating race conditions
- Causing inconsistent metric collection
Your concern is completely valid. This implementation means we need to be even more careful with the configuration:
More Conservative Recommendation:
1. Start with larger values:
SCRAPE_INTERVAL_SECONDS = 30 # or higher
OFFSET_MINUTES = 2 # provides adequate overlap
2. Monitor your system for:
- CPU usage of the exporter
- Memory usage
- Prefect API response times
- Number of concurrent API requests
3. If you need to use shorter intervals:
- Consider implementing a proper concurrency control mechanism
- Monitor for overlapping scrapes
- Watch for memory leaks or resource exhaustion
This is a significant correction to my previous response, and it makes your original question even more important. The simple while True
loop with time.sleep()
is not a robust way to handle concurrent operations, and users should be aware of this limitation when configuring the exporter.
Thank you for bringing this to my attention - it's a crucial detail that completely changes the recommendations for safe configuration of the exporter.Daniel Chan
07/01/2025, 4:03 PMwhile True
loop allow Prefect to pause between scrapes? IS there some sort of background process that is firing frequently, and only paused when there is a time.sleep()
occurring?Marvin
07/01/2025, 4:04 PMMarvin
07/01/2025, 4:04 PMwhile True
loop in the prometheus-prefect-exporter works in a very straightforward way:
1. Single Thread Execution:
python
while True:
time.sleep(scrape_interval_seconds) # Thread blocks here
# Scrape code runs here
# When done, loops back to sleep
There is no background process. Instead:
- The program runs in a single thread
- time.sleep()
literally blocks that thread for the specified duration
- Nothing else happens during the sleep
- When sleep ends, the scrape code runs
- After scrape completes, it loops back to sleep again
Here's a detailed timeline:
0:00:00 - Loop starts
0:00:00 - Hits time.sleep(5) → Thread blocks for 5 seconds
0:00:05 - Sleep ends
0:00:05 - Scrape starts
0:00:07 - Scrape finishes (took 2 seconds)
0:00:07 - Hits time.sleep(5) again → Blocks for 5 seconds
0:00:12 - Sleep ends
0:00:12 - Next scrape starts
This is actually problematic because:
1. If a scrape takes 2 seconds, you're actually getting intervals of 7 seconds (5s sleep + 2s scrape)
2. The interval isn't consistent - it depends on how long each scrape takes
3. There's no guarantee about timing precision
A more robust implementation would:
- Use a proper scheduler (like asyncio
or apscheduler
)
- Account for scrape duration in the interval timing
- Handle concurrent scrapes properly
- Use proper thread/process management
This simple loop implementation means that:
- The "pause" is just the thread being blocked
- There's no background process - it's just sequential execution
- The actual interval between scrape starts will always be longer than your configured scrape_interval_seconds
by however long the scrape takes
- If a scrape takes longer than your interval, you'll get concurrent scrapes because there's no mechanism to prevent them
This helps explain why setting very short intervals can be dangerous - you could easily end up with multiple overlapping scrapes if they take longer than your interval to complete.Daniel Chan
07/01/2025, 4:05 PMwhile True
loop allows the prometheus-prefect-exporter to collect metrics on a scheduled basis.
Can you help me understand the main.py code?
import os
import base64
import logging
import time
import uuid
from metrics.metrics import PrefectMetrics
from metrics.healthz import PrefectHealthz
from prometheus_client import start_http_server, REGISTRY
def metrics():
"""
Main entry point for the PrefectMetrics exporter.
"""
# Get environment variables or use default values
loglevel = str(os.getenv("LOG_LEVEL", "INFO"))
max_retries = int(os.getenv("MAX_RETRIES", "3"))
metrics_addr = os.getenv("METRICS_ADDR", "0.0.0.0")
metrics_port = int(os.getenv("METRICS_PORT", "8000"))
offset_minutes = int(os.getenv("OFFSET_MINUTES", "3"))
url = str(os.getenv("PREFECT_API_URL", "<http://localhost:4200/api>"))
api_key = str(os.getenv("PREFECT_API_KEY", ""))
api_auth_string = str(os.getenv("PREFECT_API_AUTH_STRING", ""))
csrf_client_id = str(uuid.uuid4())
scrape_interval_seconds = int(os.getenv("SCRAPE_INTERVAL_SECONDS", "30"))
# Configure logging
logging.basicConfig(
level=loglevel, format="%(asctime)s - %(name)s - [%(levelname)s] %(message)s"
)
logger = logging.getLogger("prometheus-prefect-exporter")
# Configure headers for HTTP requests
headers = {"accept": "application/json", "Content-Type": "application/json"}
if api_auth_string:
api_auth_string_encoded = base64.b64encode(api_auth_string.encode("utf-8")).decode("utf-8")
headers["Authorization"] = f"Basic {api_auth_string_encoded}"
<http://logger.info|logger.info>("Added Basic Authorization header for PREFECT_API_AUTH_STRING")
if api_key:
headers["Authorization"] = f"Bearer {api_key}"
<http://logger.info|logger.info>("Added Bearer Authorization header for PREFECT_API_KEY")
# check endpoint
PrefectHealthz(
url=url, headers=headers, max_retries=max_retries, logger=logger
).get_health_check()
##
# NOTIFY IF PAGINATION IS ENABLED
#
enable_pagination = str(os.getenv("PAGINATION_ENABLED", "True")) == "True"
pagination_limit = int(os.getenv("PAGINATION_LIMIT", 200))
if enable_pagination:
<http://logger.info|logger.info>("Pagination is enabled")
<http://logger.info|logger.info>(f"Pagination limit is {pagination_limit}")
else:
<http://logger.info|logger.info>("Pagination is disabled")
# Create an instance of the PrefectMetrics class
metrics = PrefectMetrics(
url=url,
headers=headers,
offset_minutes=offset_minutes,
max_retries=max_retries,
client_id=csrf_client_id,
csrf_enabled=str(os.getenv("PREFECT_CSRF_ENABLED", "False")) == "True",
logger=logger,
# Enable pagination if not specified to avoid breaking existing deployments
enable_pagination=enable_pagination,
pagination_limit=pagination_limit,
)
# Register the metrics with Prometheus
<http://logger.info|logger.info>("Initializing metrics...")
REGISTRY.register(metrics)
# Start the HTTP server to expose Prometheus metrics
start_http_server(metrics_port, metrics_addr)
<http://logger.info|logger.info>(f"Exporter listening on {metrics_addr}:{metrics_port}")
# Run the loop to collect Prefect metrics
while True:
time.sleep(scrape_interval_seconds)
if __name__ == "__main__":
metrics()
Marvin
07/01/2025, 4:05 PMMarvin
07/01/2025, 4:05 PMprometheus_client
library. Let me break down the important parts:
1. First, look at this critical line:
python
REGISTRY.register(metrics)
Here, metrics
is an instance of PrefectMetrics
which must implement Prometheus's collector interface.
2. Then there's:
python
start_http_server(metrics_port, metrics_addr)
This starts a web server that Prometheus will scrape.
The crucial part that isn't obvious from this code: The while True
loop isn't actually responsible for metric collection!
Here's what really happens:
1. Server Setup:
python
# This starts a web server in a separate thread
start_http_server(metrics_port, metrics_addr)
2. When Prometheus Scrapes:
- Prometheus makes an HTTP request to your exporter
- The prometheus_client
library automatically calls your collector's collect()
method
- This happens regardless of the while True
loop
3. The while True
loop:
python
while True:
time.sleep(scrape_interval_seconds)
- This loop does nothing except keep the main program running
- It's essentially just preventing the program from exiting
- Could be replaced with something like signal.pause()
A more accurate sequence:
1. Program starts
2. Metrics collector is registered
3. HTTP server starts in background thread
4. Main thread enters infinite sleep loop
When Prometheus scrapes:
→ HTTP request comes to background server
→ Server triggers collect() method
→ Fresh metrics are gathered and returned
(All this happens independently of the sleep loop!)
The scrape_interval_seconds
in this code isn't actually controlling when metrics are collected - that's controlled by:
1. Your Prometheus server's configuration
2. When Prometheus decides to scrape
3. The collect()
method in the PrefectMetrics
class
This means:
- The sleep loop is just keeping the program alive
- Actual metric collection happens on-demand when Prometheus scrapes
- The SCRAPE_INTERVAL_SECONDS
in this code isn't really doing anything useful
A more proper implementation might look like:
python
import signal
def metrics():
# ... same setup code ...
# Start the HTTP server
start_http_server(metrics_port, metrics_addr)
<http://logger.info|logger.info>(f"Exporter listening on {metrics_addr}:{metrics_port}")
# Wait for signals (more proper than while True/sleep)
signal.pause()
This explains why the current implementation can lead to concurrent scrapes - the while True
loop isn't actually controlling when metrics are collected, it's just keeping the program running while the real work happens in response to HTTP requests from Prometheus.
The actual scrape interval should be configured in your Prometheus server's configuration, not in this exporter.Daniel Chan
07/01/2025, 5:21 PMSCRAPE_INTERVAL_SECONDS
is actually being used? I think there was some PR that removed something called polling_interval_seconds
from PrefectMetrics here: https://github.com/PrefectHQ/prometheus-prefect-exporter/commit/7a3d105235c9ba6020df8f111bc2cf88905a7ffa
And then a subsequent PR adds SCRAPE_INTERVAL_SECONDS
here: https://github.com/PrefectHQ/prometheus-prefect-exporter/pull/57/filesMarvin
07/01/2025, 5:21 PMMarvin
07/01/2025, 5:22 PMClarify purpose of SCRAPE_INTERVAL_SECONDS and its relationship to Prometheus scraping
Body:
## Background
I noticed some historical changes regarding scrape intervals in the exporter:
1. PR <https://github.com/PrefectHQ/prometheus-prefect-exporter/commit/7a3d105235c9ba6020df8f111bc2cf88905a7ffa|7a3d105> removed `polling_interval_seconds` from PrefectMetrics
2. PR <https://github.com/PrefectHQ/prometheus-prefect-exporter/pull/57|#57> later added `SCRAPE_INTERVAL_SECONDS`
## Current Implementation
The current code in main.py uses `SCRAPE_INTERVAL_SECONDS` in a while True loop:
```python
while True:
time.sleep(scrape_interval_seconds)
## Questions/Concerns
1. Purpose Unclear: The current implementation seems to just keep the program alive with a sleep loop, but doesn't appear to actually control metric collection timing since:
- Metric collection should happen when Prometheus scrapes the HTTP endpoint
- The actual collection timing should be controlled by Prometheus server's configuration
- The sleep loop doesn't prevent concurrent scrapes
2. Historical Context:
- What was the rationale for removing polling_interval_seconds
?
- Why was SCRAPE_INTERVAL_SECONDS
added later?
- Are these two parameters related or serving different purposes?
3. Documentation Needed:
- What is the intended purpose of SCRAPE_INTERVAL_SECONDS
?
- How should users configure this in relation to their Prometheus scrape interval?
- Does this parameter actually affect metric collection timing?
## Impact
This lack of clarity could lead to:
- Misconfigurations where users think they're controlling scrape timing
- Potential resource issues if scrape intervals are set too low
- Confusion about how to properly configure the exporter with Prometheus
## Proposed Solutions
1. Either:
- Document the actual purpose of SCRAPE_INTERVAL_SECONDS
- Remove it if it's not serving a useful purpose
- Replace the while/sleep loop with a more appropriate mechanism to keep the program alive
2. Consider adding documentation about:
- How metric collection is actually triggered
- Best practices for configuring scrape intervals
- The relationship between this setting and Prometheus server configuration
Would appreciate any clarification on these points to help users better understand and configure the exporter correctly.
```
This issue:
1. Provides clear historical context
2. Identifies specific concerns
3. Suggests concrete improvements
4. Focuses on user impact and confusion
5. Proposes specific solutions
Would you like me to modify any part of this draft before you submit it?Daniel Chan
07/01/2025, 5:53 PMMarvin
07/01/2025, 5:53 PMMarvin
07/01/2025, 5:54 PMprefect_info_flow_runs
might not show end_time for completed flow runs:
1. Race Condition in Metric Collection:
- If the Prometheus scrape happens right as the flow is completing
- The state might be "Completed" but the end_time hasn't been fully processed yet
2. Flow Run State Transitions:
- End times are recorded when flows reach their final state
- If a flow terminates abnormally or the state transition isn't properly recorded, the end_time might not be set
3. Metric Collection Timing:
- If your OFFSET_MINUTES
is too small, you might catch flows in transition
- If your Prometheus scrape interval is too frequent, you might see inconsistent states
To troubleshoot this:
1. Check Flow Run Details:
python
from prefect.client import get_client
async with get_client() as client:
flow_run = await client.read_flow_run(flow_run_id)
print(f"State: {flow_run.state.type}")
print(f"End Time: {flow_run.end_time}")
2. Adjust Collection Parameters:
- Increase OFFSET_MINUTES
slightly (e.g., to 2-3 minutes)
- Ensure your Prometheus scrape interval isn't too aggressive
3. Verify Flow Completion:
- Check the Prefect UI for the specific flow runs
- Confirm they show as properly completed with end times
- Compare this with what's showing in Prometheus metrics
Would you be able to:
1. Check if the end times are visible in the Prefect UI for these flows?
2. Share what values you're using for OFFSET_MINUTES
and Prometheus scrape interval?
3. Confirm if this happens for all completed flows or just some of them?
This would help narrow down whether it's a metric collection issue or a flow run state recording issue.Daniel Chan
07/01/2025, 5:55 PMMarvin
07/01/2025, 5:55 PMMarvin
07/01/2025, 5:57 PMSCRAPE_INTERVAL_SECONDS=5
and OFFSET_MINUTES=1
) and the inconsistent end_time behavior, I can now explain what's likely happening:
1. Timing Window Issue:
[now - 1 minute] <-------- 1 minute window --------> [now]
^ ^ ^
Window start Flow completes Scrape time
- With a 1-minute window, flows that completed 61+ seconds ago drop out
- Flows that just completed might not have their end_time fully recorded yet
2. Race Condition Scenario:
Time 0s: Flow reaches "Completed" state
Time 0.1s: State recorded
Time 0.2s: Prometheus scrapes (sees Completed but no end_time yet)
Time 0.3s: end_time recorded
Time 5s: Next scrape (might miss it if flow is now outside window)
Recommendations:
1. Adjust Your Configuration:
OFFSET_MINUTES=2
SCRAPE_INTERVAL_SECONDS=15
This gives:
- Longer window to catch state transitions
- More time between scrapes for states to settle
- Better chance of capturing complete metadata
2. Alternative Configuration:
OFFSET_MINUTES=1
SCRAPE_INTERVAL_SECONDS=30
This maintains your shorter window but:
- Reduces load on your Prefect API
- Gives more time for state transitions to complete
- Still provides good metric resolution
The issue occurs because:
1. State transitions happen in multiple steps
2. Your 5-second scrape interval might catch flows mid-transition
3. The 1-minute window might be too tight for catching all metadata
Think of it like taking a photo of a moving object:
- Too fast a shutter speed (short scrape interval) = might catch partial information
- Too narrow a frame (small offset) = might miss the complete picture
Would you be willing to try either:
1. Increasing OFFSET_MINUTES to 2, or
2. Increasing SCRAPE_INTERVAL_SECONDS to 15 or 30
This should make the end_time reporting more consistent while still maintaining reasonable metric freshness.Daniel Chan
07/01/2025, 5:58 PMMarvin
07/01/2025, 5:58 PMMarvin
07/01/2025, 5:58 PMSCRAPE_INTERVAL_SECONDS
in the prometheus-prefect-exporter.
Looking at the code you shared earlier:
python
# From main.py
while True:
time.sleep(scrape_interval_seconds)
This SCRAPE_INTERVAL_SECONDS
parameter isn't actually related to when Prometheus scrapes your metrics. Instead, it's just keeping the main program alive with a sleep loop. It's not doing anything useful! The actual scraping is controlled by your Prometheus server's configuration.
Let me clarify how the components work together:
1. Prometheus Server (configured in prometheus.yml)
yaml
scrape_configs:
- job_name: 'prefect'
scrape_interval: 15s # THIS controls when Prometheus collects metrics
static_configs:
- targets: ['localhost:8000']
2. Prometheus Prefect Exporter
- `OFFSET_MINUTES=10`: Looks back 10 minutes for flow runs
- `SCRAPE_INTERVAL_SECONDS`: Doesn't affect metric collection at all!
When Prometheus scrapes the exporter:
1. Prometheus makes HTTP request to exporter
2. Exporter checks flow runs from [now - OFFSET_MINUTES]
to [now]
3. Returns current metrics
So you're right about OFFSET_MINUTES
! Setting it higher (like 10) would:
- Give a larger window to catch all flow states
- Ensure you don't miss any flow runs
- Allow more time for end_times to be recorded
But SCRAPE_INTERVAL_SECONDS
in the exporter is essentially a no-op - it's not controlling anything meaningful. The actual scrape timing is controlled by your Prometheus server configuration.
I apologize for my previous responses that implied SCRAPE_INTERVAL_SECONDS
was important - I was incorrect. This parameter could probably be removed from the exporter entirely as it's not serving any real purpose.
Would you like help drafting an issue about this? It seems like this parameter might be confusing users and should either be:
1. Removed entirely
2. Renamed to clarify its (lack of) purpose
3. Documented to explain it doesn't affect metric collectionDaniel Chan
07/01/2025, 6:09 PMMarvin
07/01/2025, 6:09 PMMarvin
07/01/2025, 6:10 PMOFFSET_MINUTES=60:
- 1 flow per minute = ~60 flows per query
- Each flow has state, metadata, etc.
OFFSET_MINUTES=10:
- 1 flow per minute = ~10 flows per query
- 6x less data per query
However, the disadvantages are often not as severe as they might seem because:
1. Prefect's API is optimized for these queries
2. The data volume for flow runs is usually manageable
3. Database indexes make historical queries efficient
4. Memory usage is typically not significant unless you have thousands of flows per hour
Recommendation:
If OFFSET_MINUTES=10 is working better than smaller values:
1. Stick with 10 if:
- You're seeing complete data
- API performance is good
- Most flows complete within 10 minutes
2. Consider increasing to 15-30 if:
- You have longer-running flows
- You still occasionally miss end_times
- Server resources aren't constrained
3. Go to 60 only if:
- You have very long-running flows
- Server resources are abundant
- You need guaranteed capture of all flow states
The main consideration isn't really about what's detrimental, but rather what's necessary for your use case. If 10 minutes captures all your flow run data reliably, there's no need to go higher. If you do need to go higher, the performance impact is usually acceptable unless you're dealing with an extremely high volume of flows.
Would you like to:
1. Share how many flows you typically run per hour?
2. Indicate if you have any very long-running flows?
This would help fine-tune the OFFSET_MINUTES recommendation for your specific case.Daniel Chan
07/01/2025, 6:12 PMprometheus/prefect_info_flow_runs/gauge
? I saw some entries about 10 minutes back, and they have now disappeared.Marvin
07/01/2025, 6:12 PMMarvin
07/01/2025, 6:13 PMprometheus/prefect_info_flow_runs/gauge
is likely due to how the prometheus-prefect-exporter works with Prometheus gauges. Here's what's happening:
1. Time Window Behavior:
[now - OFFSET_MINUTES] <-------- window --------> [now]
|
Only flows in this
window are reported
- The exporter only reports flows within the OFFSET_MINUTES window
- As time moves forward, flows "fall out" of this window
- When a flow falls out of the window, its gauge entry disappears
2. Example Timeline:
10:00 AM: Flow A completes
10:05 AM: You see Flow A in metrics (within window)
10:11 AM: Flow A disappears (outside window)
3. Why This Happens:
- Prometheus gauges show current state
- The exporter only reports flows within its time window
- Once a flow is outside the window, it's no longer reported
- Prometheus then shows the gauge as "missing" for that flow
To retain metrics longer, you can:
1. Increase OFFSET_MINUTES:
- Larger window = metrics visible longer
- But remember this affects all queries
2. Use Prometheus Recording Rules:
yaml
rules:
- record: flow_runs_completed_total
expr: sum(prefect_info_flow_runs)
- This preserves historical data
3. Configure GCP Metric Explorer:
- Adjust the time range
- Use aggregation rules
- Set appropriate alignment periods
The key thing to understand is that prefect_info_flow_runs
is designed to show the current state within your configured window. It's not meant to be a historical record - for that, you'd want to:
1. Use Prometheus recording rules
2. Configure proper metric storage in GCP
3. Consider using different metric types for historical data
Would you like help with:
1. Setting up recording rules to preserve historical data?
2. Configuring GCP Metrics Explorer for better retention?
3. Understanding what window size would work best for your use case?Daniel Chan
07/01/2025, 7:05 PMEcho Flow
, where other entries of Echo Flow
show only one entry? Similarly, our flow named Nucleus Ingestion
has one row for some of the entries, but then other entries for completed flow runs don't show any end_time rows.Marvin
07/01/2025, 7:05 PMMarvin
07/01/2025, 7:06 PMFlow Run ID: ABC123
Metrics show:
- Row 1: state=Running, start_time=present, end_time=null
- Row 2: state=Completed, start_time=present, end_time=present
You see duplicates because:
- The exporter creates separate gauge entries for different combinations of labels
- A flow run might have multiple states within your OFFSET_MINUTES window
- The exporter might catch the flow in transition between states
2. Missing end_time for "Nucleus Ingestion":
This inconsistency typically happens when:
Flow Run 1: Clean completion
- State transition: Running → Completed
- end_time properly recorded
Flow Run 2: Irregular completion
- State transition: Running → Completed (but interrupted)
- end_time might not be recorded properly
To troubleshoot this:
1. Check Flow Run States:
python
from prefect.client import get_client
async with get_client() as client:
# Get flow runs for your flows
flow_runs = await client.read_flow_runs(
flow_name={"any_": ["Echo Flow", "Nucleus Ingestion"]}
)
for run in flow_runs:
print(f"Flow: {run.name}")
print(f"State: {run.state.type}")
print(f"End Time: {run.end_time}")
print("---")
2. Monitor State Transitions:
- Look for flows that might be terminating irregularly
- Check if there are any infrastructure issues causing incomplete state transitions
- Verify if any flows are being manually stopped or killed
3. Recommendations:
a. For duplicate entries:
python
# In your flow code
@flow(
name="Echo Flow",
retries=0, # Prevent automatic retries
persist_result=True # Ensure state transitions are recorded
)
b. For missing end_time:
- Ensure flows have proper error handling
- Add finalizers to guarantee state transitions
- Consider increasing OFFSET_MINUTES to catch delayed state transitions
Would you be able to:
1. Check if the duplicate entries correspond to retried or rerun flows?
2. Verify in the Prefect UI if the flows with missing end_time show as properly completed?
3. Share how these flows are being deployed and executed (e.g., Docker, Kubernetes)?
This information would help pinpoint whether this is a:
- Flow execution issue
- Metric collection issue
- State transition issue
- Or a combination of these factorsDaniel Chan
07/01/2025, 7:07 PMMarvin
07/01/2025, 7:07 PMMarvin
07/01/2025, 7:08 PMpython
from prefect.client import get_client
from datetime import datetime, timezone, timedelta
async def check_flow_run_times(flow_name):
async with get_client() as client:
# Get recent flow runs for the specified flow
flow_runs = await client.read_flow_runs(
flow_name={"any_": [flow_name]},
sort="-start_time",
limit=5
)
for run in flow_runs:
print(f"\nFlow Run ID: {run.id}")
print(f"State: {run.state.type}")
print(f"Start Time: {run.start_time}")
print(f"End Time: {run.end_time}")
if run.end_time and run.start_time:
duration = run.end_time - run.start_time
print(f"Duration: {duration}")
This would help us compare:
1. What the UI shows
2. What the API returns
3. What Prometheus is capturing
Would you like me to provide a more detailed script to investigate specific flow runs that are showing inconsistent end_time reporting in your metrics? This could help us understand why the prometheus-prefect-exporter isn't consistently capturing the end_time even though it's present in the UI.Daniel Chan
07/01/2025, 7:09 PMMarvin
07/01/2025, 7:09 PMMarvin
07/01/2025, 7:09 PMpython
from prefect.client import get_client
async def show_actual_times(flow_run_id):
async with get_client() as client:
run = await client.read_flow_run(flow_run_id)
print(f"Start Time: {run.start_time}")
print(f"End Time: {run.end_time}") # This exists in the API even though UI shows duration
print(f"Duration shown in UI: {run.end_time - run.start_time if run.end_time and run.start_time else 'N/A'}")
This means that when you see inconsistencies in the prometheus-prefect-exporter's end_time reporting, you can't directly verify against the UI. Instead, you'd need to:
1. Get the flow run ID from the UI
2. Use the API to check the actual end_time
3. Compare that with what's showing in your Prometheus metrics
I apologize for my earlier confusion about the UI display. Thank you for the correction - it helps maintain accuracy in our troubleshooting process.Daniel Chan
07/01/2025, 7:15 PMcurl -s <http://prefect.com/api/flow_runs/5db158c6-cc71-442b-8645-ba8e5b81a69c|prefect.com/api/flow_runs/5db158c6-cc71-442b-8645-ba8e5b81a69c> | jq '{id, name, start_time, end_time}'
{
"id": "5db158c6-cc71-442b-8645-ba8e5b81a69c",
"name": "tangerine-anaconda",
"start_time": "2025-07-01T19:01:46.665632Z",
"end_time": "2025-07-01T19:03:04.016106Z"
}
Marvin
07/01/2025, 7:15 PMMarvin
07/01/2025, 7:16 PMjson
{
"start_time": "2025-07-01T19:01:46.665632Z",
"end_time": "2025-07-01T19:03:04.016106Z"
}
This flow run:
- Started at 190146
- Ended at 190304
- Total duration: ~1 minute 18 seconds
Let's check a few things:
1. Time Window Check:
Flow end time: 19:03:04
Current time: [your current time]
OFFSET_MINUTES window: [current time - OFFSET_MINUTES] to [current time]
- Is this flow run within your OFFSET_MINUTES window?
- Remember: if it's outside the window, it won't be reported
2. Metric Labels:
The exporter should create metrics with labels like:
prefect_info_flow_runs{
flow_name="tangerine-anaconda",
flow_run_id="5db158c6-cc71-442b-8645-ba8e5b81a69c",
state="completed"
}
Can you:
1. Share what OFFSET_MINUTES you're currently using?
2. Run this curl command to see what the exporter is returning:
bash
curl -s <http://localhost:8000/metrics> | grep prefect_info_flow_runs
(adjust the port if different)
This will help us determine if:
- The flow run is within your time window
- The exporter is seeing the flow run but not the end_time
- Or if the flow run isn't being captured at all
Also, since you have the flow run ID, you could check the full state information:
bash
curl -s <http://prefect.com/api/flow_runs/5db158c6-cc71-442b-8645-ba8e5b81a69c/states> | jq '.'
This would show us if there's anything unusual about the state transitions that might affect how the exporter processes this flow run.
The fact that the API shows a clear end_time but the exporter doesn't reflect it suggests either:
1. A time window issue (flow run outside OFFSET_MINUTES)
2. A state processing issue in the exporter
3. Or a metric label matching issue
Can you share this additional information so we can pinpoint the exact cause?Daniel Chan
07/01/2025, 7:25 PMcurl -s <http://prefect.com/api/flow_runs/5db158c6-cc71-442b-8645-ba8e5b81a69c> | jq '.'
{
"id": "5db158c6-cc71-442b-8645-ba8e5b81a69c",
"created": "2025-07-01T19:01:21.432118Z",
"updated": "2025-07-01T19:03:04.022184Z",
"name": "tangerine-anaconda",
"flow_id": "14f42903-d3e9-4ffb-bab0-0d14d5af08bb",
"state_id": "d65d295c-ee16-43ca-bc05-a9c8859e0c68",
"deployment_id": "25a4caac-2326-4c53-9037-4ce3fbd67427",
"deployment_version": "d77d3af200a17e4a45e96437d10aeb13b08431a7",
"work_queue_id": "ad3510a3-48e9-4880-949a-a2f15e11cb27",
"work_queue_name": "sideflows",
"flow_version": "44982b98c486777348b5c531ef5be081",
"idempotency_key": "ee0354bc-104c-4b07-9a49-b244d7a962cc-ZIP_PENDING",
"context": {},
"empirical_policy": {
"max_retries": 0,
"retry_delay_seconds": 0.0,
"retries": 0,
"retry_delay": 0,
"pause_keys": [],
"resuming": false,
"retry_type": null
},
"parent_task_run_id": null,
"state_type": "COMPLETED",
"state_name": "Completed",
"run_count": 1,
"expected_start_time": "2025-07-01T19:01:21.336739Z",
"next_scheduled_start_time": null,
"start_time": "2025-07-01T19:01:46.665632Z",
"end_time": "2025-07-01T19:03:04.016106Z",
"total_run_time": 77.350474,
"estimated_run_time": 77.350474,
"estimated_start_time_delta": 25.328893,
"auto_scheduled": false,
"infrastructure_document_id": null,
"infrastructure_pid": "3fc64242-17a3-4a4f-8c66-5ff7839807c6:prefect-jobs:tangerine-anaconda-kq2mh",
"created_by": null,
"work_pool_id": "72f1d141-0944-49fa-b651-b2c95b4585fe",
"work_pool_name": "queue-based-k8s-pool",
"state": {
"id": "d65d295c-ee16-43ca-bc05-a9c8859e0c68",
"type": "COMPLETED",
"name": "Completed",
"timestamp": "2025-07-01T19:03:04.016106Z",
"message": null,
"data": null,
"state_details": {
"flow_run_id": "5db158c6-cc71-442b-8645-ba8e5b81a69c",
"task_run_id": null,
"child_flow_run_id": null,
"scheduled_time": null,
"cache_key": null,
"cache_expiration": null,
"deferred": null,
"untrackable_result": true,
"pause_timeout": null,
"pause_reschedule": false,
"pause_key": null,
"run_input_keyset": null,
"refresh_cache": null,
"retriable": null,
"transition_id": "88f3a170-4d10-43c3-987c-cabdbaa9d965",
"task_parameters_id": null,
"traceparent": null
}
}
}
Marvin
07/01/2025, 7:25 PMMarvin
07/01/2025, 7:25 PMjson
{
"state_type": "COMPLETED",
"start_time": "2025-07-01T19:01:46.665632Z",
"end_time": "2025-07-01T19:03:04.016106Z",
"state": {
"type": "COMPLETED",
"timestamp": "2025-07-01T19:03:04.016106Z",
"state_details": {
"untrackable_result": true,
"transition_id": "88f3a170-4d10-43c3-987c-cabdbaa9d965"
}
}
}
A few things stand out:
1. The flow is properly completed with matching timestamps:
- end_time
matches the state timestamp
- State type is "COMPLETED"
- Has a valid transition_id
2. Interesting flag: "untrackable_result": true
- This might affect how the exporter processes the flow run
- Could be related to why the end_time isn't showing up in metrics
To help diagnose this:
1. Can you share:
bash
# Get the current metrics
curl -s <http://localhost:8000/metrics> | grep "prefect_info_flow_runs.*5db158c6-cc71-442b-8645-ba8e5b81a69c"
2. And your current configuration:
bash
# Show exporter environment variables
env | grep "OFFSET_MINUTES"
3. Also check if there are any other metrics for this flow run:
bash
curl -s <http://localhost:8000/metrics> | grep "tangerine-anaconda"
The fact that we see:
- A proper COMPLETED state
- Matching timestamps
- But untrackable_result: true
Suggests we might want to:
1. Check if the exporter is filtering out flows with untrackable results
2. Verify if there are any metrics at all for this flow run
3. Confirm if this is happening for other flows with untrackable_result: true
Can you share:
1. The output from those curl commands?
2. Whether other flow runs with missing end_times also have untrackable_result: true
?
This would help us determine if this is:
- A general issue with the exporter
- Specific to flows with untrackable results
- Or related to how the time window is being calculatedDaniel Chan
07/02/2025, 3:30 PMMarvin
07/02/2025, 3:30 PMMarvin
07/02/2025, 3:31 PM