Sometimes, one of our flows will fail with a CRASH...
# ask-community
s
Sometimes, one of our flows will fail with a CRASHED state. We've learned that this is a Prefect infrastructure error, so retries will not happen. However, we need to figure out how to detect a crash and resend the flow. Any suggestions? For what it's work, we're seeing
Flow run infrastructure exited with non-zero status code -9.
with the crash.
n
hmm, do you have dynamic flow run params for this flow? if not, it should be as simple as making an automation that: • trigger: on flow run enter Crashed • action: Run Deployment > Infer Deployment
🙌 1
I will raise this internally either way, since it would be nice for Infer Deployment to re-use flow run params in this case
z
What type of infrastructure are you using?
s
OK thanks for the response. I was not aware of those triggers and actions. We are using GCP and Kubernetes.
t
this seems to be the same issue mentioned here: https://prefect-community.slack.com/archives/CL09KU1K7/p1688664433269099 always firing the same deployment upon a crash could create an infinite loop in case (for example) that the crash is due to an inherent problem with the job (like it going
OOM
for example) there needs to be some more robust solution….
also, this creates a problem for
run_deployment
of subflows that might crash (e.g. due to happenstance of an eviction happening in k8s): https://github.com/PrefectHQ/prefect/issues/10620