Has anyone got spurious CancelledError:s when runn...
# ask-community
h
Has anyone got spurious CancelledError:s when running with Dask? Flow id:
e2631f0e-2987-405f-bf3b-3fbb46acf90c
Might be from missing heartbeats from one of the nodes?
a
I think you’re right that this might happen with long running flow runs that have lost flow’s heartbeat. The 
concurrent.futures._base.CancelledError
 can result from a long-running step in computation where there is no output (logging or otherwise) to the 
Client
. In these cases, due to the lack of interaction with the client, the scheduler regards itself as “idle” and times out after the configured 
cloudprovider.ecs.scheduler_timeout
 period, which defaults to 5 minutes. The CancelledError error message is misleading, but if you look in the logs for the scheduler task itself it will record the idle timeout. The solution is to set 
scheduler_timeout
 to a higher value, either via config or by passing directly to your cluster class constructor. The answer is stolen from here.
h
No I don't think this is the case.
This happened after 1 minute and 30 seconds when, just like in https://github.com/dask/distributed/issues/2628, the scheduler and workers and prefect agent were all newly deployed
a
This issue is quite old so it would be surprising to me if this is related, but even then, the conclusion was “ultimately the error arises from sharing futures between clients and is therefore not a bug” - is this what you are doing?
👀 1
h
I don't know how Prefect does it, maybe it does, maybe not. I'm not touching the client in my own code, it's just plain tasks.
but if you look in the logs for the scheduler task itself it will record the idle timeout.
There's no timeout in the scheduler logs for me.
This time I got it in the "finish" task that collects all the results and makes them available, after 1h 57m
I promise I have not pressed cancel 😅
a
@haf have you been able to solve your problem ? Or at least reproduce it ? I have the same problem and I have no clue why the cause is
h
@Ali Abdelmotalib No, I had to re-implement the service in pure python to get around these issues — and the cancelled error might be different from https://prefect-community.slack.com/archives/CL09KU1K7/p1638014722400800