:wave: We routinely have flows in prefect UI sta...
# prefect-kubernetes
k
👋 We routinely have flows in prefect UI stay in a 'Running' state for multiple days, while the infrastructure (k8s job pods) crash without prefect getting notified. These jobs typically complete in <10min. On the work pool (job/pod configuration) we don't specify anything here.
Job Watch Timeout Seconds (Optional)
Number of seconds to wait for each event emitted by a job before timing out. If not set, the worker will wait for each event indefinitely.
I understand this to mean that if a job pod stops emitting events (pod crashes without sending any notice), prefect will wait indefinitely (7 days is the max I think?) to receive an event. Does anyone experience the same thing, or could provide guidance on an appropriate timeoutSeconds? I imagine timeoutSeconds should be specific to the actual work being conducted by each job, so if I was to provide a default, I would want someone relatively large e.g 5min. Appreciate any thoughts!
m
What version prefect worker are you using? https://github.com/PrefectHQ/prefect/issues/12988
k
Thanks for linking @Max Eggers - good issue for learning more about k8s-prefect. • For our flows, the docker image uses prefect-client==3.0 • Worker helm chart version 2024.5.30190018 which is the latest before the new release candidate
m
I think if you are using prefect 3 you'd want your worker to be using prefect 3.0.1 to have the various changes that address this, not sure which helm chart that'd come in.
k
Agreed. Theres one newer chart - 2024.5.31205053 3.0.0rc1