Hi For some reason for a week I am getting task runs that st Prefect Community #prefect-server

Hi For some reason for a week I am getting task ru...

Pierre Monico

01/17/2022, 10:09 AM

Hi For some reason for a week I am getting task runs that stay stuck for several days. Even when hitting the cancel button in the UI nothing helps and I have to set the state to “cancelled” manually. • Is there any way to set a time out for tasks globally? • Is there a way to also avoid retrying task runs that are late? Basically none of my flows are too critical and I simply want to cancel them if they stay stuck too long and not retry. (the problem now is that after a few tasks get stuck at some point I run out of memory and the server crashes)

Anna Geller

01/17/2022, 10:26 AM

I think it might be worth trying to find out why the flow runs get repeatedly stuck. 1. Did you check the logs? 2. Was there something in the logs about flow’s heartbeat being lost? Usually when a flow run is stuck in a Running state, it might be a flow’s heartbeat issue. This thread explains the issue and shows some possible solutions you may try. 3. What infrastructure is it running on - is your flow run doing some long-running job on Kubernetes/Databricks etc? 4. Is your flow run memory intensive and could cause running out of memory? You can set a timeout on your task decorator:

Copy code

timeout (Union[int, timedelta], optional): The amount of time (in seconds) to wait while
            running this task before a timeout occurs; note that sub-second
            resolution is not supported, even when passing in a timedelta.

To set the same globally, this syntax should work:

Copy code

export PREFECT__TASKS__DEFAULTS__TIMEOUT=3600 # in seconds

Regarding late runs, perhaps you could try to implement a state handler similar to this.

Pierre Monico

01/17/2022, 10:42 AM

Thanks a lot @Anna Geller. 1. Always gets stuck on the same sort of task (extraction) but for several days so I am not sure why i.e.

requests

wouldn’t time out and make the task fail 2. Thanks for the hint, I’ll check that 3. It’s on a simple VM (in the process of migrating) and I am not doing any long-running jobs 4. Not really

Pierre Monico

01/17/2022, 10:42 AM

I actually removed retries from my tasks and will set the timeout globally now, thanks!

👍 1

Pierre Monico

01/17/2022, 10:45 AM

Are you sure this is available as a global setting though? The config only shows the two other options as default.

Anna Geller

01/17/2022, 10:51 AM

Good catch, I'm not sure about it at all. The timeouts are hard to do in general because it can be hard to stop a task run that e.g. gets executed on a remote Dask cluster in Kubernetes. So if possible I would try to find out the root cause of the flow runs being stuck rather than trying to kill those through timeouts or other backend mechanism. Did you try the suggestions from the shared thread?

Pierre Monico

01/17/2022, 11:02 AM

Ok got you. Did not try it yet because my heartbeats seem to work as far as I can tell. It’s more of an organizational/project thing but I think I’ll stick with strict timeouts for now until I migrate to a cleaner infra (k8s) - very soon

👍 1

Pierre Monico

01/17/2022, 11:02 AM

Thanks again!

Pierre Monico

01/31/2022, 10:23 AM

@Anna Geller just wanted to confirm that setting the timeout globally actually doesn’t work. Only the two available settings under

tasks.defaults

from

config.toml

work.

Pierre Monico

01/31/2022, 10:23 AM

Do you think it would be possible to include

timeout

as global option? I could send a PR at some point

Anna Geller

01/31/2022, 10:32 AM

Hard to say, feel free to submit a PR or open an issue and we can investigate together with you and other engineers

👍 1

8 Views

Open in Slack

Previous Next