<@ULVA73B9P> If infrastructure get killed because ...
# marvin-in-the-wild
m
@Marvin If infrastructure get killed because it failed or it was a spot instance that got yanked, and retries are specified for the flow, why will it not retry again, why does it just fail?
m
So, why aren't retries working when infrastructure fails or a spot instance gets yanked? Let me check for relevant details on this issue. First, can you tell me the version of Prefect you're using?
m
@Marvin prefect 2.19.3
m
### Here's what I found: Github Issues: 1. Add flow run submission retries for infrastructure creation failures: This issue discusses the exact problem you're encountering – flows failing due to infrastructure issues and not retrying. It suggests adding retries and retry delay seconds directly in the infrastructure block to handle external failures, ensuring the flow attempts to rerun. 2. Support retrying flows on crashed state: This issue addresses how retries for flows currently don't handle crashes and suggests implementing retry functionality for
crashed
states as well. 3. Flow with Retry Delay Seconds runs state hooks and automations multiple times: It addresses a related problem where retries and delays cause redundant execution of state hooks and automations, which might be causing additional issues in retries. ### Key Problems and Proposed Solutions - Prefect currently does not handle retries for flows that fail due to infrastructure or external issues out of the box. - Discussions in the community suggest the need for retries and delay management directly in the infrastructure configuration. ### Proposed Actions: 1. Retries Configuration in Infrastructure: Check if newer versions or updates to Prefect have added configurations for retries directly in the infrastructure blocks, like ECSTask or KubernetesJob. This might solve the problem directly. 2. Handle
Crashed
States
: Implement custom logic or automation that can handle crashed states and respect retry limits. If these solutions don't cover your needs, consider following or contributing to the related GitHub issues to track or push for updates. Lastly, these solutions are relevant for Prefect 2.x. If you're open to considering an upgrade, subsequent versions may have incorporated additional fixes. Hope this helps!