<@ULVA73B9P> how can I re-run the flows that `Cras...
# ask-marvin
j
@Marvin how can I re-run the flows that
Crashed
using automations? Keep in mind that this rerun should use the same parameters than the original flow run
m
🔄 Researching your answer... this may take a while 🔧 Using tool:
research_prefect_topic
📊 Tools used: •
research_prefect_topic
(1x)
j
@Marvin
@Marvin do you have an answer for previous question?
m
âś… thought for 9.6 seconds
I don’t see your previous question in this thread—could you paste it here? If you can, include: - Prefect version (are you on 3.x?) - Where you’re running (local, Docker, Vertex AI work pool) - Error message/traceback or relevant logs - Minimal code/deployment snippet If this was about Vertex AI work pools or a Docker architecture mismatch, I can help right away—just share details like the worker image/tag, host/arch, work pool config, and any agent/worker logs.
j
@Marvin Prefect v3, how can I re-run the flows that
Crashed
using automations? Keep in mind that this rerun should use the same parameters than the original flow run. I'm using it with Cloud Run jobs
m
âś… thought for 655.6 seconds
Yes—use an Automation that listens for crashed flow runs and triggers a “Run a deployment” action that re-launches the same deployment with the original parameters via templating. Here’s the pattern (Prefect 3.x): UI steps - Create Automation - Trigger: Flow run enters state → Crashed - Optional filter: limit to specific deployment(s) or tags if you don’t want this global - Action: Run a deployment - For Deployment, click “Use template” and set:
Copy code
{{ flow_run.deployment_id }}
- For Parameters, click “Use template” and set:
Copy code
{{ flow_run.parameters }}
That creates a brand-new flow run from the same deployment with the exact same parameters as the crashed run. API payload snippet If you prefer the API, the important part is templating the deployment_id and parameters. Prefect 3 automations use Jinja with a wrapper:
Copy code
{
  "type": "RunDeployment",
  "deployment_id": {
    "__prefect_kind": "jinja",
    "template": "{{ flow_run.deployment_id }}"
  },
  "parameters": {
    "__prefect_kind": "jinja",
    "template": "{{ flow_run.parameters }}"
  }
}
Notes and caveats - This creates a new flow run; it does not mutate the state of the original crashed run. - To avoid infinite loops if the rerun also crashes, add a guard. Two common options: - Filter so the trigger ignores runs with a tag (e.g., only rerun if the crashed run does not have tag
auto-rerun
), and add that tag on the new run via the action’s tags field. - Scope the automation to specific deployments you want to auto-rerun. - Cloud Run jobs: no special handling needed. The rerun will dispatch to the same work pool and Cloud Run job configuration as the original deployment. Just ensure your work pool has capacity/quotas and the job image/permissions are valid. Docs - Using parameters in Automation templates - Creating Automations and deployment triggers - Workers overview (Cloud Run workers included) Want me to draft a full automation JSON you can POST (with your deployment scope and an optional “only once” guard via tags)? If so, tell me if this should apply to all deployments or a specific set, and whether you want to add a one-time rerun guard.
j
@Marvin after the
Run a deployment
Action type, there's no way to "use template", I can just leave
Infer deployment
or pick a specific deployment, and there's no way I can use template
m
âś… thought for 186.7 seconds
You’re right—the “Run a deployment” action doesn’t let you template the deployment field in the UI. For your goal (same parameters as the original run), you have two workable patterns: Option A (recommended): Reschedule the same flow run This reuses the original run and therefore its parameters automatically. - Trigger: Flow run → enters state → Crashed - Action: Change flow run state → Scheduled - Optional: add a small delay (e.g., 1–5 minutes) to avoid immediate thrashing if infrastructure is still winding down. This works well with Cloud Run jobs: your work pool will pick the same run up again and launch a new job with the same parameters. Option B: Create a new run of the same deployment with original parameters - In the “Run a deployment” action, select “Infer deployment” to target the same deployment as the crashed run. - Parameters: set them to the original run’s parameters via templating. If your UI does not show a template toggle for the Parameters field, you can create/edit the automation via API with a templated parameters payload, e.g.:
Copy code
{
  "actions": [
    {
      "type": "RunDeployment",
      "deployment_id": {
        "__prefect_kind": "jinja",
        "template": "{{ flow_run.deployment_id }}"
      },
      "parameters": {
        "__prefect_kind": "jinja",
        "template": "{{ flow_run.parameters }}"
      }
    }
  ],
  "triggers": [
    {
      "type": "EventTrigger",
      "match": {
        "event": "prefect.flow-run.Crashed"
      }
    }
  ],
  "name": "Rerun crashed flows with same params"
}
Guards to avoid loops - If you reschedule the same run (Option A), consider a delay and/or a retry limit strategy (e.g., after N crashes, notify instead of rescheduling). If you need help setting a guard (tags, a compound/sequence trigger, or a metric-based trigger), tell me your preference and I’ll draft it. Docs - Using parameters in Automation templates - Creating Automations - Workers overview Quick check: do you prefer rescheduling the same run (simplest, keeps same params automatically) or creating a brand-new run that copies params? I can provide a ready-to-POST automation JSON for either.
j
@Marvin I prefect the rescheduling the same run. How can I avoid running this indefinitely if there's an error that needs manual intervention?
m
âś… thought for 333.0 seconds
Great choice. Here are two reliable ways to cap reschedules when you “reschedule the same run” with Change flow run state → Scheduled: Option 1: Gate by run attempt (simplest if available in your UI) - Automation A (reschedule, up to N attempts) - Trigger: Flow run → enters state → Crashed - Condition: Event data filter → run.run_count < N - Action: Change flow run state → Scheduled (add a short delay, e.g., 120s) - Automation B (stop and alert after N attempts) - Trigger: Flow run → enters state → Crashed - Condition: Event data filter → run.run_count >= N - Action: Send notification (and/or Declare incident) Notes: - The crashed event typically includes the run attempt in its event payload (shown in UI as Attempt #). In the Automation builder, add an “Event data” filter; if you’re unsure of the exact field path, click into a recent crashed event in the UI and use “Inspect” to see the JSON payload and confirm the field (commonly run.run_count). - This approach is per-run and very predictable. Option 2: Gate by a metric (count previous crashes for the same run) If your UI doesn’t expose run.run_count filtering: - Automation A (reschedule while below the cap) - Trigger: Flow run → Crashed - Additional condition: Metric → count of events “prefect.flow-run.Crashed” for this flow run within, say, 24h is < N - Action: Change flow run state → Scheduled (delay 120s) - Automation B (on/after the Nth crash) - Trigger: Flow run → Crashed - Additional condition: Metric → count of events “prefect.flow-run.Crashed” for this flow run within 24h is >= N - Action: Send notification / Declare incident Extra tips - Add a delay (1–5 minutes) on the reschedule to avoid thrashing if Cloud Run infra is still tearing down. - Consider a notification on every reschedule too, but mark the “exceeded attempts” path distinctly so you know when it requires manual intervention. - This works cleanly with Cloud Run jobs; the same flow run will be picked back up by your Vertex/Cloud Run work pool with the same parameters. If you’d like, I can draft a ready-to-POST automation JSON for either approach—just tell me your preferred cap N and delay seconds. Useful refs: - Creating Automations and triggers - Detecting and handling zombie/crashed runs
j
@Nate sorry for the tag, just whenever you have a sec, could you double check the answers from Marvin? I feel they are not 100% correct?
n
@José Agretti marvin seems correct in the sense that an automation to move the flow run into scheduled seems most direct, unless you want a separate flow run object for some reason
what about the answers feels incorrect?
j
Ah ok, yeah no that should work I think. No, I was kind of worried because the UI is not showing a lot of things, but maybe I should use the API or Terraform to define things. And do you know if I should be able to add like a tag to the retry to then ignore if it's a second retry or something?
(and thank you so much for answering)
n
yea the
RunDeployment
action is not what you want here it seems, which might have been the initial confusion. you should be able to use the run count marvin mentioned to do logic specific to which retry it is
🙏 1