m

    Maria

    11 months ago
    Hi! I'm trying to figure out what is the easiest way to restart the flow that crashed (but hasn't started)? For a context, we are facing concurrency issue
    An error occurred (ClientException) when calling the RegisterTaskDefinition operation: Too many concurrent attempts to create a new revision of the specified family.
    and I'm aware there is open github issue (https://github.com/PrefectHQ/prefect/issues/4402). We did adjust params as recommended but it still happens. I want to be able to restart the flow but cannot use task retry feature as no tasks have been started (it is the only log line I see and there is no start time for the flow ):
    Kevin Kho

    Kevin Kho

    11 months ago
    Hey @Maria, if this is ECS, there was a PR in the latest release. You can check the changelog here. To answer though, Maybe you can try setting this flow state to Scheduled .
    m

    Maria

    11 months ago
    How do I set it to scheduled? I mean, automatically? I'd like to capture that it failed somehow and restart. And yes, ECS - thanks will check it out
    Kevin Kho

    Kevin Kho

    11 months ago
    Is this a one time thing or like you always want to resubmit it? If one time thing, you can go into the UI and set it from Failed to Scheduled and it will be picked up. If you want automatic, I think you could use a state handler to re-submit it by setting it from failed to scheduled, or you can kick off another flow run, but I wouldn't recommend because there is a chance also that you run into an infinite loop with that kind of setup
    m

    Maria

    11 months ago
    The issue happens a few times a day, so maybe 5% of runs fail - so looking for an automatic solution.
    Seems like I should upgrade to 0.15.7 and hope its resolved, otherwise will need to explore state handlers. My other idea was to have another flow that checks first flow every ~15 min and restarts failed jobs maybe.
    Kevin Kho

    Kevin Kho

    11 months ago
    If we can identify the
    ClientException
    from the state, then I guess it should be fine. That way, we can only restart runs that failed due to this