Has anyone got spurious CancelledError:s when running with Dask? Flow id: `e2631f0e-2987-405f-bf3b-3...

haf

12/10/2021, 11:35 AM

Has anyone got spurious CancelledError:s when running with Dask? Flow id:

e2631f0e-2987-405f-bf3b-3fbb46acf90c

haf

12/10/2021, 11:37 AM

Might be from missing heartbeats from one of the nodes?

Anna Geller

12/10/2021, 11:41 AM

I think you’re right that this might happen with long running flow runs that have lost flow’s heartbeat. The

concurrent.futures._base.CancelledError

can result from a long-running step in computation where there is no output (logging or otherwise) to the

Client

. In these cases, due to the lack of interaction with the client, the scheduler regards itself as “idle” and times out after the configured

cloudprovider.ecs.scheduler_timeout

period, which defaults to 5 minutes. The CancelledError error message is misleading, but if you look in the logs for the scheduler task itself it will record the idle timeout. The solution is to set

scheduler_timeout

to a higher value, either via config or by passing directly to your cluster class constructor. The answer is stolen from here.

haf

12/10/2021, 11:46 AM

No I don't think this is the case.

haf

12/10/2021, 11:46 AM

This happened after 1 minute and 30 seconds when, just like in https://github.com/dask/distributed/issues/2628, the scheduler and workers and prefect agent were all newly deployed

Anna Geller

12/10/2021, 11:54 AM

This issue is quite old so it would be surprising to me if this is related, but even then, the conclusion was “ultimately the error arises from sharing futures between clients and is therefore not a bug” - is this what you are doing?

👀 1

haf

12/10/2021, 4:20 PM

I don't know how Prefect does it, maybe it does, maybe not. I'm not touching the client in my own code, it's just plain tasks.

haf

12/10/2021, 4:22 PM

but if you look in the logs for the scheduler task itself it will record the idle timeout.

There's no timeout in the scheduler logs for me.

haf

12/10/2021, 5:30 PM

This time I got it in the "finish" task that collects all the results and makes them available, after 1h 57m

haf

12/10/2021, 5:31 PM

I promise I have not pressed cancel 😅

Ali Abdelmotalib

01/14/2022, 1:40 PM

@haf have you been able to solve your problem ? Or at least reproduce it ? I have the same problem and I have no clue why the cause is

haf

01/17/2022, 7:50 AM

@Ali Abdelmotalib No, I had to re-implement the service in pure python to get around these issues — and the cancelled error might be different from https://prefect-community.slack.com/archives/CL09KU1K7/p1638014722400800

6 Views

Open in Slack

Previous Next

Prefect Community

Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.