Upon a Retry is it expected that subflows behave the same wa Prefect Community #ask-community

Upon a Retry, is it expected that subflows behave ...

scott

07/25/2023, 3:32 AM

Upon a Retry, is it expected that subflows behave the same way as their parent flows with respect to only re-running tasks that failed/crashed/etc? A colleague of mine said that on retrying a flow with subflows the tasks of the subflow that failed all re-run, vs. just starting from the failed one. I can’t reproduce this with a simple example - so I’m dropping in here to see what the expected behavior is. Thanks!

Deceivious

07/25/2023, 8:41 AM

Example: If a main flow triggers 3 subflows and one of them failes 2 sub flows will be COMPLETED. 1 sub flow will be FAILED. Main flow will be FAILED. Sub flows cannot be "Retried". If you retry the failed main flow, all the subflows will be re-executed again.

scott

07/25/2023, 4:12 PM

Thanks for confirming. Would be nice to hear from a Prefect employee about whether this is a feature or maybe unintended behavior that can be changed

scott

07/25/2023, 4:13 PM

There’s always the option to use

run_deployment

instead of a subflow, but that’s not as tidy of a solution

Deceivious

07/25/2023, 4:15 PM

I think ever if run_deployment is used , it will spawn a new flow run unless the idempotency key is specified.

scott

07/25/2023, 4:15 PM

Yes, i know that.

scott

07/28/2023, 12:08 AM

@Will Raphaelson curious if you can confirm this is expected behavior for subflows? i.e., that they restart from the beginning if the parent flow is retried?

Deceivious

07/28/2023, 9:54 AM

I tested it only the failed sub flows are restarted

Will Raphaelson

07/28/2023, 3:15 PM

yeah but I think what scott is asking is within a failed subflow, are all tasks rerun. based on my repro, it does seem like the expected behavior that on retry, a new subflow (ignorant of task states) is spawned and thus all tasks are rerun. So I think the current behavior is essentially that on parent retry, subflow runs are re-created from scratch, not re-tried. One thing i want to try real quick is if the manual toggling of result persistence will effect the behavior here. if not, i think we should probably hash out the merits on a github issue, as there may be a valid technical reason this is the way that it is.

Will Raphaelson

07/28/2023, 3:28 PM

yeah that didnt work either. let me circle up with some folks internally on this.

Will Raphaelson

07/28/2023, 3:39 PM

@scott I remain curious about if we could enable true subflow retries, but also wondering if you get where you need to go just by adding retries to the subflow itself? then the parent flow never fails at all and we limit the retry to the true flow run retry behavior on only the subflow

scott

07/28/2023, 4:10 PM

Thanks for your response @Will Raphaelson!

wondering if you get where you need to go just by adding retries to the subflow itself? then the parent flow never fails at all and we limit the retry to the true flow run retry behavior on only the subflow

That seems problematic if the parent flow can never fail, yeah? I assume by adding retries you mean

retries

param of

@flow

https://docs.prefect.io/2.11.0/api-ref/prefect/flows/#prefect.flows.Flow ? But that’s not linked to hitting the Retry button in the UI, right?

scott

07/28/2023, 4:11 PM

Thanks for confirming this is expected behavior - it’s nice to get that info

Will Raphaelson

07/28/2023, 4:18 PM

ahh okay i might need more detail. so for one - the retry button and the kwarg on the decorator do the same thing, which is to put the flow in a retrying state, which only reruns failed tasks. i did think you were referring to the flow run kwargs retry functionality, and so my suggestion was to add a kwarg to the subflow decorator. I also assumed the flow was being failed because of failed tasks in the subflow. when i said “the parent flow never fails” im assuming that a retry on the subflow succeeds, and thus the whole chain succeeds. this is slightly dizzying 😵‍💫

Will Raphaelson

07/28/2023, 4:19 PM

im still asking internally if we can get retries of a parent flow (regardless of their method of initiation) to Actually retry subflows instead of creating new ones. that seems like a correct behavior.

scott

07/28/2023, 4:20 PM

im still asking internally if we can get retries of a parent flow (regardless of their method of initiation) to Actually retry subflows instead of creating new ones. that seems like a correct behavior.

Yep, that’s what I’m looking for 🙏

Will Raphaelson

07/28/2023, 4:23 PM

yeah i think thats a good enhancement proposal. just from being around prefect for a while it smells like there may be a “good” reason we didnt take it on, but let me continue to poke. ill follow up with an issue if I think we can tackle it.

scott

07/28/2023, 4:29 PM

Okay, thanks very much! If it helps, here’s the two major routes we’re considering instead - no need to respond: 1. Use subflow functions like

my_subflow.fn()

which i think makes its tasks run as if they weren’t in a subflow, but at least in our code there is still separation. we’re going with this route for now. 2. Replace all subflows with

run_deployment

with

timeout=None

so we wait for it to finish - this route didn’t work because any failures in these deployments appear to not cascade up to the parent flow, so it’s not a useful approach. Sure, we can retry these deployments nested within the parent flow separately, but that’s not a great user experience to have to manage retries separately for each “subflow” (that’s really a separate deployment)

👀 1

Will Raphaelson

07/28/2023, 4:29 PM

thanks for that info

Deceivious

07/29/2023, 7:07 AM

What happens to the failed subflow run? I doubt it would be cleared from the database. It does get detached from the calling subflow right?

Tom Klein

08/30/2023, 12:57 AM

we are also interested in this issue as we have a huge, long-lasting parent flow with lots (hundreds) of subflows spawned and executed async via

run_deployment

for whatever reason, the parent flow died (who knows, maybe got evicted by k8s - although - there’s no indication for it in the agent logs…) - but then we lost all the progress info on the subflows. The subflows themselves write their data to the DB, so - they don’t “return” anything and we don’t care about them other than to know if they succeeded or not. They are also idempotent so there’s no HARM in them running more than once, but - it’s obviously a waste of time & resources. Because of the issue described in this thread, once retried by Prefect, the parent flow retries to execute all subflows. Oddly, the waterfall remains the same so we have long orange (

Late

) strands of the old subflows but when we click into them, they seem to show a new subflow with no logs, no previous runs, etc. We are trying to understand if we should be using an

idempotency key

- and if so - what strategy should we choose so that this doesn’t interfere with other flow runs with different params? e.g. - does

{parent-flow-run-id}-{subflow-index}

makes sense? (assuming all the subflows are indexed from 0 to N) ? what would happen exactly when the parent flow is retried? it just skips them or it can “find” the old flow run and recognize that it was complete (for the sake of composing the final state - which is the union of all final states of all subflows) or - will the re-run just think to itself it has one less sub-flow? Also: 1. are there any plans to fix this or change this behavior? @Will Raphaelson 2. @scott i’m not sure i understood why you suggested to add

timeout=None

? we currently wait for the subflows to finish just by virtue of them being (await)

gathered

with asyncio. am i missing something? Also, we had no problem to see errors propagating up from subflows.. maybeit’s related to a recent change that was made in a recent version?

Tom Klein

08/30/2023, 1:41 PM

I’ll just update that - I just ran this exact thing with

idempotency_key

- and it does seem like older subflows (that were generated with

run_deployement

) are “safe” from re-execution once their

idempotency_key

is set correctly i mean that in the sense that not only are they not re-executed, but also it seems like their status is remembered. HOWEVER, the flow was not actually “retried” per se - but rather it seems like it keeps getting evicted and re-run by the K8s infra. Prefect does seem to consider it to be new runs (in the run count) so take that with a grain of salt

6 Views

Open in Slack

Previous Next