Hi all, looking for some help debugging here. I ha...
# ask-community
s
Hi all, looking for some help debugging here. I have a task which finishes fine when run locally, the logs indicate it finished fine when run through an agent, but the task itself is stuck in Running. Its a simple flow (it turns a tiny pandas dataframe into a json object), and I make a call to the logger directly before the return statement. I see my log statement that it all worked, but the task state doesnt change. Is there a good way of debugging what on earth might be going on? The odd thing is that the task it never finishes is a common task (I download 4 things and process them all in the same way). 2 of them finish, every time. 2 of them get stuck on running, without fail. EDIT: I seem to have found a potential bug in Prefect. These hang ups occur when I have
@task(timeout=20)
, but they just completed successfully with a normal
@task
annotation. Will update prefect now and check to see if that helps
For context, heres the task:
And heres the task stuck in running:
z
Hey @Samuel Hinton -- do you think you could share a minimal reproducible example? What executor are you using & what OS is your agent running on?
s
Its running on ubuntu at the moment, local dask executor, and currently on prefect 0.14.5. Going through the changelog at https://docs.prefect.io/api/latest/changelog.html and I notice the TimeoutError is in 0.14.7
Ill update now and try and make a minimal example if its still being annoying
z
Hmm I'm not sure that should affect you but being on the soonest version will be helpful! https://github.com/PrefectHQ/prefect/issues/4091
s
@Zanie here is a reproduction, now on 0.14.11 of prefect. Some of these timed out correctly (though honestly Im not sure why any of them should have timed out), but others (see image) didnt and run forever, despite the timeout. Local Dask Executor, launched on a docker agent, ubuntu operating system, flows run perfectly if I remove the timeout. Runs instantly when I run it locally via
flow = get_flow()
and
flow.run()
(even with the timeouts)
z
Thanks for the reproduction! Just wondering, how is this running on a docker agent without a flow.storage or flow.run_config?
s
Ah I left those details out - I have multiple flows and a
manager.py
goes through them all, collects the flows, and assigns the run config, scheduler, bucket, etc:
@Zanie - overnight I scheduled the test flow to run every half hour. I registered two versions, one with and one without the timeouts. I had a few tasks that were stuck in running in the morning (6hours on that process task to return the json dataframe), but most of them succeeded. However, even for the flows that succeeded, those that had the timeout took consistently longer than those without. You can see two screenshots below (timeout vs no timeout, identical tasks). Tasks with timeout consistently take many times longer than those without. The task that just generated a dummy dataframe and returns it takes ~ 0.25 seconds without the timeout, and 2seconds with the timeout. Do you know why this might be?
z
With a timeout, there needs to be a supervising process to enforce the timeout. Generally this will increase task runtime a bit because of that overhead. The executor you are using combined with the system you are on determines what kind of supervising process we have to use, some of which have faster startup times and higher dependability.
I'm not sure why some of your tasks are hanging, people use timeouts often without issue. I'll have to try to replicate your exact runtime environment (which is why I needed the run_config/executor details).
Could you try switching your executor to use "processes" instead of "threads" ?
s
I’ll give it a shot when Im back in office for sure, will let you know if that changes things 🙂
z
Unfortunately I could not reproduce this in a test (https://github.com/PrefectHQ/prefect/pull/4217) although it uses a shorter timeout. The code that's being used to call
your_task.run()
is https://github.com/PrefectHQ/prefect/blob/master/src/prefect/utilities/executors.py#L184 -- it may be useful to try testing your function in isolation to see what's going on
s
Just reporting back that since swapping to processes Ive seen much better behaviour and execution times, so thanks a ton for the tip
z
Glad that worked! Timeouts are tricky.
🙏 1