https://prefect.io logo
Title
a

Andreas Nigg

12/03/2022, 1:33 PM
Hey! I've a little problem with long running parent flows: There is one parent flow and two sub-flows. The subflows both need between 2 and 30 hours (yes, yes, I know...). Now, quite often, the parent flow enters state "crashed" with State message "Flow run infrastructure exited with non-zero status code -1.". Interestingly, it always happens around runtime of 4 hours. Fortunately, the subflows still continue to run and they also finish. Therefore, it does not really impact the pipeline, because the pipeline simply succeeds - but it's not really nice, that parent flows. Any ideas what could cause the parent to fail? In the logs of the infrastructure (see thread), everything seems fine. (One thing I have to admit: The flows are generating approx. 3 lines of log per second. Is this the problem?). (Second thing to admit: There are a looot of tasks. About 2.000 tasks). prefect 2.0 cloud, v2.7.0 agents. Infrastructure is a kubernetes job. EDIT: I changed to logs and tasks a little bit to have way less logs and tasks. In total there are less than 100 tasks and about 10 logs per minute. But still the same. The parent flows crashed after 4 hours - and the subflows happily continued.
(As you see in the logs, as of the time of writing, the flow is displayed as "crashed" in the UI - but i still keeps running and logging - don't mind time difference of 1h between UTC logs and my local-time windows machine)
An update after weekend-long flow runs 😄 This issue seems reproducible with v2.7.0. Have a subflow which runs for > 4 hours. The parent flow is marked as crashed, while the subflows happily continue running.
2022-12-04T12:07:28.047048204Z 12:07:28.045 | ERROR   | prefect.infrastructure.kubernetes-job - Job 'omega6-sulafat-z-2fhpm': Job did not complete.
2022-12-04T12:07:28.265814357Z 12:07:28.264 | INFO    | prefect.agent - Reported flow run '7d098a5f-c32e-4fa4-9e34-de30c43e8572' as crashed: Flow run infrastructure exited with non-zero status code -1.
Is this related to this issue here: https://github.com/PrefectHQ/prefect/issues/7743 ? Or worth a separate issue (don't want to spam prefect engineers with duplicated issues..)?
l

Luca Schneider

12/05/2022, 2:29 PM
Hi, I m having a smilar issue where I run ML training for a few hours, and deploy it afterwards as subflow. See here : https://prefect-community.slack.com/archives/CL09KU1K7/p1670248970739099
j

Jean Luciano

12/06/2022, 4:18 PM
Yup it looks like you both found the right issue to comment on.