wave We routinely have flows in prefect UI stay in a Runnin Prefect Community #prefect-kubernetes

:wave: We routinely have flows in prefect UI sta...

09/25/2024, 5:15 PM

👋 We routinely have flows in prefect UI stay in a 'Running' state for multiple days, while the infrastructure (k8s job pods) crash without prefect getting notified. These jobs typically complete in <10min. On the work pool (job/pod configuration) we don't specify anything here.

Job Watch Timeout Seconds (Optional)

Number of seconds to wait for each event emitted by a job before timing out. If not set, the worker will wait for each event indefinitely.

I understand this to mean that if a job pod stops emitting events (pod crashes without sending any notice), prefect will wait indefinitely (7 days is the max I think?) to receive an event. Does anyone experience the same thing, or could provide guidance on an appropriate timeoutSeconds? I imagine timeoutSeconds should be specific to the actual work being conducted by each job, so if I was to provide a default, I would want someone relatively large e.g 5min. Appreciate any thoughts!

Max Eggers

09/25/2024, 5:43 PM

What version prefect worker are you using? https://github.com/PrefectHQ/prefect/issues/12988

09/25/2024, 6:46 PM

Thanks for linking @Max Eggers - good issue for learning more about k8s-prefect. • For our flows, the docker image uses prefect-client==3.0 • Worker helm chart version 2024.5.30190018 which is the latest before the new release candidate

Max Eggers

09/25/2024, 8:30 PM

I think if you are using prefect 3 you'd want your worker to be using prefect 3.0.1 to have the various changes that address this, not sure which helm chart that'd come in.

09/26/2024, 6:04 PM

Agreed. Theres one newer chart - 2024.5.31205053 3.0.0rc1

7 Views

Open in Slack

Previous Next