Hi all -- a fun one. I have noticed some of my flows failing with this error:
Copy code
Task run '511abff8-faa3-4efc-94e6-f4be435db16e' received abort during orchestration: This run cannot transition to the RUNNING state from the RUNNING state. Task run is in RUNNING state
As far as I can tell it is inconsistent (work flow sometimes works, sometimes does not).
I am running a
DaskTaskRunner
back by a
SLURMCluster
. The stage that is crashing is trying to read in a set of large-ish files, and I believe the GIL is not being released as the data is being accessed. At the moment there is a high load on the disk I/O, and my best guess is that the dask nanny is somehow failing a health check, and prefect in turn is causing some round-about error like this.
Any ideas that make more sense?
👍 1
y
Yaron Levi
02/16/2023, 11:59 AM
Please look at this questions, and the reply I’ve made:
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.