im getting dask timeouts with my temporary clusters im runni Prefect Community #ask-community

im getting dask timeouts with my temporary cluster...

Andrew Lawlor

05/16/2022, 5:15 PM

im getting dask timeouts with my temporary clusters. im running dask in gke with prefect cloud, and im seeing a flow hang on

Copy code

Creating a new Dask cluster with `__main__.get_executor.<locals>.<lambda>`...`

in my gke logs, im seeing

Copy code

raise TimeoutError( asyncio.exceptions.TimeoutError: Nanny failed to start in 60 seconds

Kevin Kho

05/16/2022, 5:40 PM

This seems like a Dask specific issue, but I am not seeing much beyond this. Can the workers connect to the scheduler?

Andrew Lawlor

05/16/2022, 5:50 PM

im sure it is dask specific but i dont know much about dask. i think the issue is they cant connect to the scheduler if im reading the logs correctly. i just created the cluster like so

Copy code

DaskExecutor(
      cluster_class=lambda: KubeCluster(make_pod_spec(image=prefect.context.image), env=env),
      adapt_kwargs={"minimum": 2, "maximum": 3},
    )

so a new temporary cluster gets spun up with each flow, which seems to be the preferred way to run a flow with a dask executor. is there a recommended way to set up the temporary cluster?

Kevin Kho

05/16/2022, 5:55 PM

I’ve done the same on GKE autipilot before and it’s worked for me. I was following this. Do you have quotas that prevent you from spinning up more workers?

Andrew Lawlor

05/16/2022, 5:59 PM

i also followed that but im not using autopilot bc i was seeing different issues https://prefect-community.slack.com/archives/CL09KU1K7/p1650566183898799 i dont think i have quotas set, but which ones should i be checking for?

Kevin Kho

05/16/2022, 6:02 PM

Ah I see. I think for quotas, you could see pods failing to start?

4 Views

Open in Slack

Previous Next