Hi For some reason for a week I am getting task ru...
# prefect-server
Hi For some reason for a week I am getting task runs that stay stuck for several days. Even when hitting the cancel button in the UI nothing helps and I have to set the state to “cancelled” manually. • Is there any way to set a time out for tasks globally? • Is there a way to also avoid retrying task runs that are late? Basically none of my flows are too critical and I simply want to cancel them if they stay stuck too long and not retry. (the problem now is that after a few tasks get stuck at some point I run out of memory and the server crashes)
I think it might be worth trying to find out why the flow runs get repeatedly stuck. 1. Did you check the logs? 2. Was there something in the logs about flow’s heartbeat being lost? Usually when a flow run is stuck in a Running state, it might be a flow’s heartbeat issue. This thread explains the issue and shows some possible solutions you may try. 3. What infrastructure is it running on - is your flow run doing some long-running job on Kubernetes/Databricks etc? 4. Is your flow run memory intensive and could cause running out of memory? You can set a timeout on your task decorator:
Copy code
timeout (Union[int, timedelta], optional): The amount of time (in seconds) to wait while
            running this task before a timeout occurs; note that sub-second
            resolution is not supported, even when passing in a timedelta.
To set the same globally, this syntax should work:
Copy code
export PREFECT__TASKS__DEFAULTS__TIMEOUT=3600 # in seconds
Regarding late runs, perhaps you could try to implement a state handler similar to this.
Thanks a lot @Anna Geller. 1. Always gets stuck on the same sort of task (extraction) but for several days so I am not sure why i.e.
wouldn’t time out and make the task fail 2. Thanks for the hint, I’ll check that 3. It’s on a simple VM (in the process of migrating) and I am not doing any long-running jobs 4. Not really
I actually removed retries from my tasks and will set the timeout globally now, thanks!
👍 1
Are you sure this is available as a global setting though? The config only shows the two other options as default.
Good catch, I'm not sure about it at all. The timeouts are hard to do in general because it can be hard to stop a task run that e.g. gets executed on a remote Dask cluster in Kubernetes. So if possible I would try to find out the root cause of the flow runs being stuck rather than trying to kill those through timeouts or other backend mechanism. Did you try the suggestions from the shared thread?
Ok got you. Did not try it yet because my heartbeats seem to work as far as I can tell. It’s more of an organizational/project thing but I think I’ll stick with strict timeouts for now until I migrate to a cleaner infra (k8s) - very soon
👍 1
Thanks again!
@Anna Geller just wanted to confirm that setting the timeout globally actually doesn’t work. Only the two available settings under
Do you think it would be possible to include
as global option? I could send a PR at some point
Hard to say, feel free to submit a PR or open an issue and we can investigate together with you and other engineers
👍 1