hi all, we're using k8s on GKE to run our flows. they're triggered from deployments in Prefect Cloud. we're running prefect 2.10.20 with an Agent on k8s.
we occasionally see flow runs transition into a CRASHED state before RUNNING and then COMPLETED. the Run Count is 1 in this case. does anyone have suggestions of what we can check?
here's an example set of transitions:
i see that it's going into CRASHED state after 60 seconds, which is the default value for
pod_watch_timeout_seconds
. i'm going to try increasing that
s
Sunny Pachunuri
08/17/2023, 9:50 PM
Hey @Dominick Olivito: Have you figured out what is causing this? I am running my Agent in EKS and it is running all goodl But when i am running a flow it always goes into crashed status and then after couple of minutes then it will go into completed. No idea why this is happening
Sunny Pachunuri
08/17/2023, 10:04 PM
In my case crash is happenign instantaneously
d
Dominick Olivito
08/18/2023, 12:53 AM
i haven't seen it again since i increased the value of
pod_watch_timeout_seconds
to 600. it looked like our pods were sometimes taking a few minutes to start up, especially if we started several flows at the same time.
if it's going into CRASHED state immediately, i would just check that
pod_watch_timeout_seconds
is not set to 0. beyond that, i'm not sure of the other possible causes
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.