Tim Galvin
02/16/2023, 6:30 AMTask run '511abff8-faa3-4efc-94e6-f4be435db16e' received abort during orchestration: This run cannot transition to the RUNNING state from the RUNNING state. Task run is in RUNNING state
As far as I can tell it is inconsistent (work flow sometimes works, sometimes does not).
I am running a DaskTaskRunner
back by a SLURMCluster
. The stage that is crashing is trying to read in a set of large-ish files, and I believe the GIL is not being released as the data is being accessed. At the moment there is a high load on the disk I/O, and my best guess is that the dask nanny is somehow failing a health check, and prefect in turn is causing some round-about error like this.
Any ideas that make more sense?Yaron Levi
02/16/2023, 11:59 AM