<@ULVA73B9P> We're using an automation with the "F...
# ask-marvin
t
@Marvin We're using an automation with the "Flow run stays in a state" trigger (e.g., AwaitingRetry, Pending, Running, Late, Scheduled) to send Slack notifications when a flow appears stuck for 2+ hours. However, in many cases, the flow run metadata (
flow_run.name
,
flow.name
,
flow_run.state
, etc.) is missing or blank in the message template. Is this a known limitation of the "stays in state" trigger, or are there recommended ways to ensure reliable access to flow metadata in this context?
m
thought for 31.1 seconds
Based on the research and lack of specific GitHub issues, I can provide some insights and recommendations, though this seems to be a nuanced issue that might need further investigation: 1. Current Understanding: - The "Flow run stays in a state" trigger should theoretically have access to flow run metadata through Jinja templating - The behavior you're describing (missing metadata) isn't explicitly documented as a known limitation, but it's a significant concern 2. Potential Workarounds: a. Use Alternative Event Triggers: - Instead of "Flow run stays in a state", you might want to try using the "Flow run state changes" trigger - This trigger type often has more reliable access to flow run metadata as it's capturing an actual state change event b. Robust Template Variables: - Use defensive templating with default values:
Copy code
jinja
      Flow: {{ flow.name | default('Unknown Flow') }}
      Run: {{ flow_run.name | default('Unknown Run') }}
      State: {{ flow_run.state.name | default('Unknown State') }}
c. Custom Block Development: - If you need more reliable metadata access, consider developing a custom notification block that can query the Prefect API directly for flow run details 3. Recommended Next Steps: - Could you share an example of your current template configuration? This would help diagnose if there's a specific pattern causing the issue - Try logging the full context available in your automation (you can temporarily send all available variables to a debug notification to see what's accessible) - Consider opening a GitHub issue with specific examples, as this behavior should be documented if it's a limitation 4. Best Practices: - Always include error handling and default values in your templates - Consider implementing additional metadata in your notification block (like direct links to the Prefect UI for the flow run) - Monitor the automation's execution history to identify patterns in when metadata is available versus when it's missing Would you be able to share: 1. An example of your current template configuration? 2. Whether this happens consistently or only in specific scenarios? 3. Whether you see the same behavior with other trigger types? This would help us provide more targeted recommendations and potentially identify if this is a bug that should be reported to the Prefect team.
t
@Marvin Current automation setup: Trigger type: Flow run state Flow run condition: Stays in statuses AwaitingRetry, Running, Pending, Late for over 2 hours Email template: Flow run {{ flow.name }}/{{ flow_run.name }} observed in state
{{ flow_run.state.name }}
at {{ flow_run.state.timestamp }}. Flow ID: {{ flow_run.flow_id }} Flow run ID: {{ flow_run.id }} Flow run URL: {{ flow_run|ui_url }} Flow tags: {{ flow_run.tags }} State message: {{ flow_run.state.message }} Most of the time we get the following when condition is met: Flow run / observed in state `` at . Flow ID: Flow run ID: Flow run URL: None Flow tags: State message:
m
thought for 64.9 seconds
Based on your template and the research, I can now provide a more specific analysis and recommendation: 1. Identified Issue: - The behavior you're seeing (all blank metadata) suggests that the context being passed to the template when the duration-based trigger fires isn't properly populated - This appears to be different from regular state change triggers, which typically have full access to flow run metadata 2. Recommended Alternative Approach: Instead of using the "stays in state" trigger, I recommend restructuring your automation to use a combination of: a. State Change Trigger with: - Flow run state changes to: AwaitingRetry, Running, Pending, Late - Additional Conditions using the
state_timestamp
field to check duration Example condition logic:
Copy code
python
   # In the automation conditions
   from datetime import datetime, timezone
   
   # Check if the state duration exceeds 2 hours
   state_age = datetime.now(timezone.utc) - state_timestamp
   return state_age.total_seconds() > 7200  # 2 hours in seconds
This approach should provide more reliable access to flow run metadata because it's triggered by actual state changes rather than the duration-based trigger. 3. Alternative Template Structure: If you need to continue using the current trigger type, make your template more resilient:
Copy code
Flow run {{ flow.name | default('Unknown') }}/{{ flow_run.name | default('Unknown') }} 
   Current state: {{ flow_run.state.name | default('Unknown') }}
   State timestamp: {{ flow_run.state.timestamp | default('Unknown') }}
   
   Flow ID: {{ flow_run.flow_id | default('Not available') }}
   Flow run ID: {{ flow_run.id | default('Not available') }}
   Flow run URL: {{ flow_run|ui_url | default('Not available') }}
   Flow tags: {{ flow_run.tags | default([]) }}
   State message: {{ flow_run.state.message | default('No message available') }}
4. Debugging Steps: To help diagnose the issue: - Add a debug notification that dumps all available context:
Copy code
Raw context: {{ context }}
- This will help understand what variables are actually available when the trigger fires 5. Next Steps: 1. Try implementing the state change trigger approach first 2. If that's not feasible, implement the more resilient template with default values 3. Consider opening a GitHub issue with Prefect, as this behavior with the "stays in state" trigger should either be documented as a limitation or fixed Would you like me to provide more specific details about implementing any of these approaches? Also, it would be helpful to know if you've observed any patterns in terms of when the metadata is available versus when it's not (e.g., specific states or durations where it works better)?
t
@Marvin For recommendation 2, how exactly do I apply the
state_timestamp
condition? Should this logic be implemented inside our flow code (as custom state handling), or is there a place within Prefect Automations where I can configure this directly? Another thing I’m wondering: does Prefect guarantee that metadata like
flow_run.name
,
flow.name
, and
flow_run.state
are always available in this setup? I've had some runs (especially in
Late
or
AwaitingRetry
) where metadata was still missing even on state change triggers
m
thought for 56.5 seconds
Based on the research and your specific questions, I can provide a clearer picture: 1. Regarding State Timestamp Conditions: The condition logic for state duration should be configured directly in the Prefect UI's automation settings. Here's how: a. In the Prefect UI, when creating/editing your automation: - Set the trigger type to "Flow run state changes" - Select the states you want to monitor (AwaitingRetry, Running, Pending, Late) - Under "Additional Conditions", you can add a Python expression that evaluates to True/False Example condition you can paste directly in the UI:
Copy code
python
   from datetime import datetime, timezone
   state_age = (datetime.now(timezone.utc) - flow_run.state.timestamp).total_seconds()
   state_age > 7200  # 2 hours
This should be added in the "Additional Conditions" field of the automation configuration, not in your flow code. 2. Regarding Metadata Availability: After further research, there are some important nuances about metadata availability: - Core Metadata Guarantees: -
flow_run.id
and
flow_run.state.type
should always be available - These are fundamental to Prefect's state tracking system - Potentially Unreliable Metadata: -
flow_run.name
and
flow.name
might indeed be missing in certain scenarios - This is particularly true for
Late
states, which are inferred states rather than explicit transitions -
AwaitingRetry
states might have incomplete metadata if the failure that triggered the retry affected metadata collection Recommended Template Structure: Given these findings, here's a more robust template that handles potential missing metadata:
Copy code
{% set run_name = flow_run.name if flow_run.name else flow_run.id %}
   {% set flow_name = flow.name if flow.name else 'Unknown Flow' %}
   {% set state_name = flow_run.state.name if flow_run.state else 'Unknown State' %}
   
   Alert: Flow Run Issue Detected
   
   Flow: {{ flow_name }}
   Run: {{ run_name }}
   State: {{ state_name }}
   Duration in current state: {{ ((now() - flow_run.state.timestamp).total_seconds() / 3600) | round(2) }} hours
   
   {% if flow_run.state.message %}
   State Message: {{ flow_run.state.message }}
   {% endif %}
   
   {% if flow_run|ui_url %}
   View in UI: {{ flow_run|ui_url }}
   {% endif %}
   
   Flow Run ID: {{ flow_run.id }}  {# This should always be available #}
3. Best Practices for Reliable Monitoring: - Always include the
flow_run.id
in your notifications as it's the most reliable identifier - Use defensive templating with default values for all metadata fields - Consider implementing multiple automation triggers with different conditions to catch edge cases - Log all available context variables during testing to understand what's consistently available in your specific setup Would you like me to provide more specific examples of condition configurations or alternative template structures? Also, it would be helpful to know if you're seeing any patterns in when the metadata is missing - for example, is it more common with specific deployment types or execution environments?
t
@Marvin I don't see a place where I can set conditions after setting trigger type to Flow run state changes, no "Additional Conditions" option. Is that on a separate plan?
m
thought for 48.6 seconds
I apologize for my earlier confusion. I need to correct my previous response: I was incorrect about the availability of an "Additional Conditions" field in the UI. The current automation system in Prefect is more streamlined, and I apologize for suggesting a feature that isn't actually available. Given this limitation, here are the actual options available to you: 1. Current Available Options: - You can use the "Flow run stays in state" trigger (your current approach) - Or use the "Flow run state changes" trigger to catch state transitions 2. Alternative Approach: Since we can't add custom duration conditions directly in the automation UI, here are some alternative solutions: a. Multiple Automations: - One automation using "Flow run state changes" to catch immediate state transitions - Keep your existing "Flow run stays in state" automation for duration-based alerts This way, you at least get notifications with complete metadata for the state changes, while still maintaining your duration-based alerts. b. Custom Flow Run Monitoring: - Create a separate monitoring flow that periodically checks for long-running flows - This flow can use the Prefect API to query flow runs and send notifications Example monitoring flow:
Copy code
python
      from prefect import flow, task
      from prefect.client.cloud import get_client
      from datetime import datetime, timezone, timedelta
      
      @flow
      async def monitor_long_running_flows():
          async with get_client() as client:
              # Get flow runs in specific states
              flow_runs = await client.read_flow_runs(
                  flow_run_filter=FlowRunFilter(
                      state={"type": {"any_": ["RUNNING", "PENDING", "AWAITING_RETRY"]}},
                  )
              )
              
              # Check duration for each flow
              for run in flow_runs:
                  if run.state and run.state.timestamp:
                      duration = datetime.now(timezone.utc) - run.state.timestamp
                      if duration > timedelta(hours=2):
                          # Send notification with complete metadata
                          await send_alert(run)
3. Request for Enhancement: - This would be a valuable feature request for Prefect - Consider opening a GitHub issue requesting the ability to add custom conditions to automation triggers - You can reference this discussion and the use case of wanting to combine state conditions with duration checks Would you like me to: 1. Provide more details about implementing the monitoring flow approach? 2. Help you draft a feature request for custom trigger conditions? 3. Explore other potential workarounds for your specific use case? Also, regarding your second question about metadata availability - the inconsistent metadata availability you're experiencing (especially with
Late
and
AwaitingRetry
states) appears to be a legitimate issue that should probably be reported to Prefect. Would you like help documenting this behavior for a bug report?
t
@Marvin Observed automation configured to alert after a flow run stays in a status for over 3 hours - most alerts over the weekend came back without any flow metadata. Can you help investigate why these fields (like
flow.name
,
flow_run.name
, etc.) might be missing?
m
thought for 52.0 seconds
Based on the research, I can provide some insights about why you're seeing missing flow metadata in your alerts: 1. Data Retention Policy: Prefect 3.x implements data minimization practices, which means it only retains data as long as necessary. This could explain why metadata might not be available after a certain period, especially over weekends. 2. Runtime Availability: Flow metadata is primarily accessible through the runtime environment. If your alerts are trying to access this information after the flow run has completed or been archived, the metadata might no longer be available. To help address this issue, here are some recommendations: 1. Capture Metadata Early: If you need to preserve flow metadata for alerts, consider capturing these values early in your automation and storing them as part of your alert payload. You can access this information using the
prefect.runtime
module when the flow is actually running. 2. Check Your Alert Logic: Ensure your alerts are capturing the metadata at the right time. Here's a pattern you might consider:
Copy code
python
from prefect import flow
from prefect.runtime import flow_run, flow_run_name

@flow
def example_flow():
    # Capture metadata early
    metadata = {
        "flow_name": flow_run.flow_name,
        "flow_run_name": flow_run_name(),
        # Add other metadata you need
    }
    
    # Store this metadata with your alert logic
    # ... your alert configuration ...
3. Review Retention Settings: If you need to retain flow metadata for longer periods, you might want to implement your own storage solution for this information or adjust your monitoring strategy to capture and store the metadata you need when it's available. Would you like me to: 1. Research more specific details about data retention periods? 2. Provide more detailed examples of how to properly capture and store flow metadata for alerts? 3. Look into alternative monitoring approaches that might be more resilient to metadata cleanup? Please let me know which aspect you'd like me to elaborate on, or if you have any additional questions about the behavior you're observing.