<@ULVA73B9P> The syntax ended up being correct, bu...
# ask-marvin
a
@Marvin The syntax ended up being correct, but what's happening is that the automation is failing with
Copy code
"reason": "Failed to set state: Run is already in terminal state COMPLETED.",
We're using the existing set of states for the cleanup zombie runs, do you know why this is happening?
m
thought for 48.9 seconds
This error typically occurs because there's a race condition in your cleanup automation. Here's what's likely happening: 1. Your automation detects a zombie run and attempts to update its state 2. However, the run has already reached a COMPLETED state through some other means (perhaps the original process recovered and completed, or another cleanup process got there first) 3. Since COMPLETED is a terminal state, Prefect prevents any further state transitions to maintain state integrity To fix this, you should: 1. Add state checking before attempting to update the state:
Copy code
python
if flow_run.state.is_terminal():
    <http://logger.info|logger.info>(f"Run {flow_run.id} is already in terminal state {flow_run.state.type}, skipping")
    continue
2. Consider using optimistic locking when updating states to handle race conditions:
Copy code
python
try:
    await flow_run.set_state(crashed_state)
except Exception as e:
    if "already in terminal state" in str(e):
        <http://logger.info|logger.info>(f"Run {flow_run.id} reached terminal state before cleanup")
        continue
    raise
This error is actually a good thing - it's Prefect's way of maintaining state consistency. The error indicates that your cleanup automation might be trying to process the same runs multiple times or attempting to modify runs that have already been properly completed. You might also want to adjust your query for zombie runs to explicitly exclude runs that are already in terminal states when initially fetching them for cleanup.
a
@Marvin I'm using a prefect automation - I don't think I really can or should be managing race conditions. Also, this happens more often than I would expect from being race condition related.
m
thought for 70.7 seconds
Could you share your automation configuration? If you're seeing this more frequently than expected, it might be related to how the automation's triggers and actions are set up. A few things to check: 1. What trigger conditions are you using? (e.g., missing heartbeats, specific time intervals) 2. Are you using the default heartbeat settings or custom ones? 3. What state are you trying to transition these runs to? The frequency of these errors suggests there might be a mismatch between your trigger conditions and the actual lifecycle of your flows. For instance, if the heartbeat interval is too short relative to your flow's normal processing time, it might incorrectly identify completed flows as zombies.
a
@Marvin I set it up in the UI, so I don't have access to the configuration
m
thought for 60.4 seconds
Since you're using the UI, you can check and potentially adjust a few things: 1. In the UI, go to your automation and check: - The trigger conditions (particularly the heartbeat timeout duration) - The action configuration (what state it's trying to set) 2. You might want to try increasing the heartbeat timeout duration if it's set too aggressively. The default should be at least 90 seconds (3 missed heartbeats at 30-second intervals). 3. If you're seeing this error frequently, it might also help to look at the specific flow runs where this is happening. Are they particularly short-running flows? Or do they have any common characteristics? Would you be able to share: 1. How long your flows typically run for? 2. How frequently you're seeing this error? 3. Are there any patterns to when these errors occur (e.g., specific times of day, specific flows)? This information would help narrow down whether it's an issue with the automation timing or something else in the environment.
a
@Marvin this is the trigger
Copy code
{
  "type": "event",
  "match": {
    "prefect.resource.id": "prefect.flow-run.*"
  },
  "match_related": {},
  "after": [
    "prefect.flow-run.Resuming",
    "prefect.flow-run.AwaitingConcurrencySlot",
    "prefect.flow-run.AwaitingRetry",
    "prefect.flow-run.Running",
    "prefect.flow-run.Pending",
    "prefect.flow-run.Cancelling",
    "prefect.flow-run.Retrying"
  ],
  "expect": [
    "prefect.flow-run.*"
  ],
  "for_each": [
    "prefect.resource.id"
  ],
  "posture": "Proactive",
  "threshold": 1,
  "within": 10800
}
m
thought for 10.8 seconds
Ah, now I see what's happening. This trigger configuration explains the frequency of the "already in terminal state" errors. Your automation is configured to: 1. Watch for any flow run that enters one of these states: - Resuming - AwaitingConcurrencySlot - AwaitingRetry - Running - Pending - Cancelling - Retrying 2. Then expect some flow run event (
prefect.flow-run.*
) within 10800 seconds (3 hours) 3. If no event occurs within that window, the automation triggers The issue is that this configuration will trigger even for flows that complete successfully but do so after the automation checks. Here's the likely sequence: 1. Flow enters one of those states (e.g., "Running") 2. Flow completes successfully (reaches COMPLETED state) 3. But the automation is still watching that window and tries to update the state 4. Since the flow is already COMPLETED, you get the error To fix this, you should: 1. Either reduce the
within
window (3 hours might be too long for your typical flows) 2. Or modify the
expect
list to specifically look for terminal states:
Copy code
json
"expect": [
    "prefect.flow-run.Completed",
    "prefect.flow-run.Failed",
    "prefect.flow-run.Crashed",
    "prefect.flow-run.Cancelled"
]
This way, the automation will only trigger if a flow doesn't reach any terminal state within the window, which is a more accurate definition of a zombie run. Would you like me to explain how to modify these settings in the UI, or would you prefer to make these changes directly in the configuration?
a
@Marvin 3 hours is correct, I'm saying that flows in the zombie states for 3 hours should be cancelled
m
thought for 41.9 seconds
Given your clarification, the "already in terminal state" errors you're seeing might be happening because: 1. The automation is trying to cancel flows that appear stuck 2. But in some cases, between the time the automation triggers and when it tries to execute the cancellation, the flow has already reached a terminal state (either on its own or through another automation/process) For this use case, you might want to: 1. Keep your current trigger configuration (the 3-hour window is fine if that's your requirement) 2. Modify the automation's action to use a more graceful cancellation approach that handles already-completed runs Could you share what action you've configured for the automation? That might help identify if there's a better way to handle the cancellation that won't generate these errors when flows have already completed. Also, are you using any other automations or processes that might be trying to manage these same flow runs? That could explain why some flows are reaching terminal states before your cancellation automation can act on them.