https://prefect.io logo
Title
t

Ton Steijvers

02/10/2023, 4:04 PM
I posted this a while ago about
timeout_seconds
not working properly. I expect that after the timeout elapses and the flow is still in a Running state, the flow should be marked as failed. However, if the agent is not able to talk to the flow any more (maybe the pod got evicted) then its state will not be updated any more. Is this expected behaviour or can I log an issue on this one?
c

Christopher Boyd

02/10/2023, 4:10 PM
Hi Ton - do you mean if the flow started running , but then lost communication with the agent ? Or if the flow never started running ?
The former , we do have an open issue for that I can link shortly , the later should be configurable with job and pod timeouts
t

Ton Steijvers

02/10/2023, 4:33 PM
Well basically any situation where the flow will remain in Running state for more than the specified timeout period. Most of the time this is because of communication loss between agent and flow. Since I never want lingering flows in a Running state I'd like to add timeout_seconds as a safety valve.
c

Christopher Boyd

02/10/2023, 5:05 PM
Ok so you’re asking more for like a “don’t run more than this time “ so you’re not racking up usage ?
t

Ton Steijvers

02/10/2023, 5:17 PM
Isn't that exactly what
timeout_seconds
is supposed to do?
c

Christopher Boyd

02/12/2023, 4:10 PM
I think this isn’t quite the behavior you’d anticipate here at the current moment . Timeout_seconds will cause the flow to be marked as failed , not necessarily terminate the pod
But then if the flow terminates early in k8s it will be restarted
I think some additional logic / handling might be necessary here to terminate at the infra level
t

Ton Steijvers

02/13/2023, 8:05 AM
_"Timeout_seconds will cause the flow to be marked as failed"_ that's what is currently not happening. Flow remains in Running state long after timeout_seconds has elapsed (likely because it lost communication with the pod).
1
c

Christopher Boyd

02/13/2023, 1:56 PM
Gotcha, I’ll look into this and raise a bug
:gratitude-thank-you: 2
t

Ton Steijvers

02/14/2023, 3:15 PM
@Jean-Michel Provencher
j

Jean-Michel Provencher

02/14/2023, 3:15 PM
Hello @Christopher Boyd, any link for that bug so that we can track it on our end ? It's basically the biggest blocker for us required to move our production workload to prefect.
c

Christopher Boyd

02/14/2023, 3:30 PM
I discussed internally with the team, the timeout_seconds is locally evaluated, meaning when the flow is retried, the timeout_seconds is relative to that run. I think there is some discussion long term to make this a cloud side value, so it’s persisted through retries. Is the core issue the restart / de-sync behavior, or the timeout issue? timing out your flow here would be addressing the symptom, not the cause; the cause is the issue linked (7116) which is the agent becoming de-synced from the flow run
j

Jean-Michel Provencher

02/14/2023, 3:31 PM
Yes, but it would be nice to have a guarantee or a fail-safe that guarantees me a flow would not be in a running state for more than x minutes, no matter which bugs from Prefect are causing it.
c

Christopher Boyd

02/14/2023, 3:52 PM
If you have an available example / MRE and would like to raise this issue, that would be helpful. I have not yet reproduce / created the issue