https://prefect.io logo
d

Dominick Olivito

08/14/2023, 3:17 PM
hi all, we're using k8s on GKE to run our flows. they're triggered from deployments in Prefect Cloud. we're running prefect 2.10.20 with an Agent on k8s. we occasionally see flow runs transition into a CRASHED state before RUNNING and then COMPLETED. the Run Count is 1 in this case. does anyone have suggestions of what we can check? here's an example set of transitions:
Copy code
2023-08-12T20:00:48.907217+00:00 SCHEDULED Scheduled
2023-08-12T20:00:51.114967+00:00 PENDING Pending
2023-08-12T20:01:52.804776+00:00 CRASHED Crashed
2023-08-12T20:02:05.900502+00:00 RUNNING Running
2023-08-12T20:02:31.613829+00:00 COMPLETED Completed
i see that it's going into CRASHED state after 60 seconds, which is the default value for
pod_watch_timeout_seconds
. i'm going to try increasing that
s

Sunny Pachunuri

08/17/2023, 9:50 PM
Hey @Dominick Olivito: Have you figured out what is causing this? I am running my Agent in EKS and it is running all goodl But when i am running a flow it always goes into crashed status and then after couple of minutes then it will go into completed. No idea why this is happening
In my case crash is happenign instantaneously
d

Dominick Olivito

08/18/2023, 12:53 AM
i haven't seen it again since i increased the value of
pod_watch_timeout_seconds
to 600. it looked like our pods were sometimes taking a few minutes to start up, especially if we started several flows at the same time. if it's going into CRASHED state immediately, i would just check that
pod_watch_timeout_seconds
is not set to 0. beyond that, i'm not sure of the other possible causes
s

Sunny Pachunuri

08/18/2023, 1:50 PM
Thanks a lot Dominick