I think you’re right that this might happen with long running flow runs that have lost flow’s heartbeat.
The
concurrent.futures._base.CancelledError
can result from a long-running step in computation where there is no output (logging or otherwise) to the
Client
. In these cases, due to the lack of interaction with the client, the scheduler regards itself as “idle” and times out after the configured
cloudprovider.ecs.scheduler_timeout
period, which defaults to 5 minutes. The CancelledError error message is misleading, but if you look in the logs for the scheduler task itself it will record the idle timeout.
The solution is to set
scheduler_timeout
to a higher value, either via config or by passing directly to your cluster class constructor.
The answer is stolen from
here.