https://prefect.io logo
t

Tom Klein

08/31/2023, 3:54 AM
Strange phenomenon for flows with subflows (that are run with
run_deployment
, but maybe also those that are not) - The parent flow is marked as “failed” before the subflows finished. The last ones remaining happened to have a task that failed and is being retried - but somehow the parent flow decided to move to a terminal state does anyone know anything about this? experienced it before?
s

Saiful Khan

08/31/2023, 4:06 AM
Something similar is happening to us as well. We run subflows and wait (sync) for them to finish. The subflow finishes successfully, but the parent flow fails when attempting to fetch the results of the subflow. It says “_Flow run could not be submitted to infrastructure_”. But that is incorrect because I can see the subflow was successful and its results were also stored on S3 where they should be.
Happens about once every 500 runs. We are using ECS infrastructure.
t

Tom Klein

08/31/2023, 4:21 AM
interesting, wonder if it’s related. We run on EKS though… 🤔
s

Saiful Khan

08/31/2023, 4:23 AM
I don’t think infrastructure is the reason though. We had capacity and availability issues, but we’ve handled those since. This seems like a prefect agent/server issue. not sure
1
t

Tom Klein

09/04/2023, 12:28 PM
ok, i think we understand a bit better what’s going on (in our case at least?) - i.e. the subflow crashes (for example, cause of an eviction) and is then retried by K8s, but the crash has already been “returned” to the parent as a terminal state, i.e. it doesn’t wait for all infra retries to be exhausted (i guess?)
s

Saiful Khan

09/04/2023, 12:33 PM
I don’t think it fails because of something like this in our case. We do not even have retries enabled on our subflows. In the example above, all subflows succeeded on their first try because it is such a trivial task. And yet, they were reported to be failed by the parent flow. As the traceback shows, it fails at fetching the subflow result data for some mysterious reason.
t

Tom Klein

09/04/2023, 12:35 PM
the retries (in our case) are not in the prefect level but the infra level. i think Prefect doesn’t consider Crashes as failures so it always lets the infra retry as many times as it “wants” and only marks the flow as failed when the infra says its failed -- BUT - i think the exception is
run_deployment
which does (incorrectly) interpret a crash as a failure i don’t think our case is the same, unfortunately (or - if they are related, then it doesn’t appear so)
s

Saiful Khan

09/04/2023, 12:37 PM
right, our problems are likely different. Two bugs expressing very similar behavior.
😞 1