I <https prefect community slack com archives CL09KU1K7 p167 Prefect Community #ask-community

I <posted this> a while ago about `timeout_seconds...

Ton Steijvers

02/10/2023, 4:04 PM

I posted this a while ago about

timeout_seconds

not working properly. I expect that after the timeout elapses and the flow is still in a Running state, the flow should be marked as failed. However, if the agent is not able to talk to the flow any more (maybe the pod got evicted) then its state will not be updated any more. Is this expected behaviour or can I log an issue on this one?

Christopher Boyd

02/10/2023, 4:10 PM

Hi Ton - do you mean if the flow started running , but then lost communication with the agent ? Or if the flow never started running ?

Christopher Boyd

02/10/2023, 4:10 PM

The former , we do have an open issue for that I can link shortly , the later should be configurable with job and pod timeouts

Christopher Boyd

02/10/2023, 4:11 PM

https://github.com/PrefectHQ/prefect/issues/7116

Ton Steijvers

02/10/2023, 4:33 PM

Well basically any situation where the flow will remain in Running state for more than the specified timeout period. Most of the time this is because of communication loss between agent and flow. Since I never want lingering flows in a Running state I'd like to add timeout_seconds as a safety valve.

Christopher Boyd

02/10/2023, 5:05 PM

Ok so you’re asking more for like a “don’t run more than this time “ so you’re not racking up usage ?

Ton Steijvers

02/10/2023, 5:17 PM

Isn't that exactly what

timeout_seconds

is supposed to do?

Christopher Boyd

02/12/2023, 4:10 PM

I think this isn’t quite the behavior you’d anticipate here at the current moment . Timeout_seconds will cause the flow to be marked as failed , not necessarily terminate the pod

Christopher Boyd

02/12/2023, 4:11 PM

But then if the flow terminates early in k8s it will be restarted

Christopher Boyd

02/12/2023, 4:12 PM

I think some additional logic / handling might be necessary here to terminate at the infra level

Ton Steijvers

02/13/2023, 8:05 AM

_"Timeout_seconds will cause the flow to be marked as failed"_ that's what is currently not happening. Flow remains in Running state long after timeout_seconds has elapsed (likely because it lost communication with the pod).

✅ 1

Christopher Boyd

02/13/2023, 1:56 PM

Gotcha, I’ll look into this and raise a bug

gratitude thank you 2

Ton Steijvers

02/14/2023, 3:15 PM

@Jean-Michel Provencher

Jean-Michel Provencher

02/14/2023, 3:15 PM

Hello @Christopher Boyd, any link for that bug so that we can track it on our end ? It's basically the biggest blocker for us required to move our production workload to prefect.

Christopher Boyd

02/14/2023, 3:30 PM

I discussed internally with the team, the timeout_seconds is locally evaluated, meaning when the flow is retried, the timeout_seconds is relative to that run. I think there is some discussion long term to make this a cloud side value, so it’s persisted through retries. Is the core issue the restart / de-sync behavior, or the timeout issue? timing out your flow here would be addressing the symptom, not the cause; the cause is the issue linked (7116) which is the agent becoming de-synced from the flow run

Jean-Michel Provencher

02/14/2023, 3:31 PM

Yes, but it would be nice to have a guarantee or a fail-safe that guarantees me a flow would not be in a running state for more than x minutes, no matter which bugs from Prefect are causing it.

Christopher Boyd

02/14/2023, 3:52 PM

If you have an available example / MRE and would like to raise this issue, that would be helpful. I have not yet reproduce / created the issue

6 Views

Open in Slack

Previous Next