https://prefect.io logo
Title
t

Tim Galvin

11/24/2022, 12:59 PM
Hi all -- has anyone seen an error like this before?
Encountered exception during execution:
Traceback (most recent call last):
  File "/software/projects/askaprt/tgalvin/setonix/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/prefect/engine.py", line 612, in orchestrate_flow_run
    waited_for_task_runs = await wait_for_task_runs_and_report_crashes(
  File "/software/projects/askaprt/tgalvin/setonix/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/prefect/engine.py", line 1317, in wait_for_task_runs_and_report_crashes
    if not state.type == StateType.CRASHED:
AttributeError: 'coroutine' object has no attribute 'type'
I am running a known version of my workflow on a known dataset, which has worked perfectly fine dozens of times before. It seems to be saying the the
state
above is not an
orion
model -- rather a coroutine. All my tasks are using the normal
task
decorator around normal non-async python functions.
There might be some intermittent errors with our lustre filesystem as it is under heavy load. But, I am totally perplexed that it could express itself like this. In this set up I have just set the
PREFECT_API_URL
variable to point to my self-hosted orion server. Could I / should I also be setting anything related to URL or DB timeouts as well do you think?
r

Ryan Peden

11/24/2022, 7:01 PM
I've seen it a couple of times recently from users running tasks with Dask. I'm trying to reproduce and investigate more today. Are you using the Dask task runner?
t

Tim Galvin

11/25/2022, 7:39 AM
Yes I am running this with a Dask task runner, which internally is using
dask_jobqeue.SLURMCluster
to fire up the
dask-workers
on the slurm resource. I think I made a connection -- it seems like one of the Dask workers in my set are being killed iwth a
Sig bus
error. I am running this on a fairly new cluster, so it might just be some teething issues coming from that.
r

Ryan Peden

11/26/2022, 1:15 AM
Hi Tim, I did a bit of testing and noticed that I was getting this error when an exception on a remote Dask worker prevented a task from running. For me, the task couldn't load because the Dask worker node was missing dependencies the task needed to run (in my case, Prefect itself was the missing dependency.) Once I made sure Prefect was installed on the worker nodes, the problem went away. This only seems to apply to exceptions that happen before the task runs. Exceptions that occur inside a running task should still work they way they normally do; Prefect catches them and reports the failed task and flow runs.
t

Tim Galvin

11/26/2022, 3:14 AM
That is a nice write up. I am reasonable certain that prefect is in the runtime environment of the dask workers. I went reading through some slurm logs and think I found my root cause. One of two things seem to happen: • The set of dask workers within a single slurm job consume more memory than I expect, so one among this set will pause itself. It seems when this happens prefect tasks assigned to is get put in a strange state and gives rise to this error (or a similar one) • The real killer - some of my jobs are running into a
Sig Bus
error outside of the python dask-worker, killing the python code outright entirely. I am trying to run this code on a new HPC cluster that is still experiencing setup issues, it think
r

Ryan Peden

11/26/2022, 4:01 AM
That makes sense; that sounds like the kind of problem I'd expect to cause that exception you saw. I've found where the issue is in Prefect's Dask task runner, at least, and we are working on fix. Once that is done, Prefect should handle the error more gracefully and provide a more informative error message than
'coroutine' object has no attribute 'type'
when something goes wrong in a Dask worker.
t

Tim Galvin

11/26/2022, 6:15 AM
Thanks a lot for that @Ryan Peden -- once again I am super pleased with the community involvement of the prefect devs! It took me a while to make the connnection between my slurm woes and this error. Once there is a patch would you be kind enough to give me a heads up? 🙂
r

Ryan Peden

12/02/2022, 12:59 AM
Hi Tim! Just following up to say you're welcome, and to let you know that the PR fixing this issue was merged, and we just released version 0.2.2 of prefect-dask which includes the fix.