I <posted this> a while ago about `timeout_seconds...
# ask-community
t
I posted this a while ago about
timeout_seconds
not working properly. I expect that after the timeout elapses and the flow is still in a Running state, the flow should be marked as failed. However, if the agent is not able to talk to the flow any more (maybe the pod got evicted) then its state will not be updated any more. Is this expected behaviour or can I log an issue on this one?
c
Hi Ton - do you mean if the flow started running , but then lost communication with the agent ? Or if the flow never started running ?
The former , we do have an open issue for that I can link shortly , the later should be configurable with job and pod timeouts
t
Well basically any situation where the flow will remain in Running state for more than the specified timeout period. Most of the time this is because of communication loss between agent and flow. Since I never want lingering flows in a Running state I'd like to add timeout_seconds as a safety valve.
c
Ok so you’re asking more for like a “don’t run more than this time “ so you’re not racking up usage ?
t
Isn't that exactly what
timeout_seconds
is supposed to do?
c
I think this isn’t quite the behavior you’d anticipate here at the current moment . Timeout_seconds will cause the flow to be marked as failed , not necessarily terminate the pod
But then if the flow terminates early in k8s it will be restarted
I think some additional logic / handling might be necessary here to terminate at the infra level
t
_"Timeout_seconds will cause the flow to be marked as failed"_ that's what is currently not happening. Flow remains in Running state long after timeout_seconds has elapsed (likely because it lost communication with the pod).
1
c
Gotcha, I’ll look into this and raise a bug
gratitude thank you 2
t
@Jean-Michel Provencher
j
Hello @Christopher Boyd, any link for that bug so that we can track it on our end ? It's basically the biggest blocker for us required to move our production workload to prefect.
c
I discussed internally with the team, the timeout_seconds is locally evaluated, meaning when the flow is retried, the timeout_seconds is relative to that run. I think there is some discussion long term to make this a cloud side value, so it’s persisted through retries. Is the core issue the restart / de-sync behavior, or the timeout issue? timing out your flow here would be addressing the symptom, not the cause; the cause is the issue linked (7116) which is the agent becoming de-synced from the flow run
j
Yes, but it would be nice to have a guarantee or a fail-safe that guarantees me a flow would not be in a running state for more than x minutes, no matter which bugs from Prefect are causing it.
c
If you have an available example / MRE and would like to raise this issue, that would be helpful. I have not yet reproduce / created the issue