ciaran
04/12/2021, 2:25 PMZanie
ciaran
04/12/2021, 3:24 PMexecutor = DaskExecutor(
cluster_class="dask_cloudprovider.aws.FargateCluster",
cluster_kwargs={
"image": worker_image,
"vpc": outputs["vpc_output"],
"cluster_arn": outputs["cluster_arn_output"],
"task_role_arn": outputs["task_role_arn_output"],
"execution_role_arn": outputs["task_execution_role_arn_output"],
"security_groups": [outputs["dask_security_group_output"]],
"n_workers": 1,
"scheduler_cpu": 256,
"scheduler_mem": 512,
"worker_cpu": 1024,
"worker_mem": 2048,
"scheduler_timeout": "15 minutes",
"tags": tags["tag_dict"],
},
)
The log from Prefect I'm seeing is:
Unexpected error: OSError('Timed out trying to connect to <tcp://34.219.0.113:8786> after 10 s')
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/distributed/comm/core.py", line 286, in connect
comm = await asyncio.wait_for(
File "/usr/local/lib/python3.8/asyncio/tasks.py", line 501, in wait_for
raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 317, in _start
await super()._start()
File "/usr/local/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 73, in _start
comm = await self.scheduler_comm.live_comm()
File "/usr/local/lib/python3.8/site-packages/distributed/core.py", line 747, in live_comm
comm = await connect(
File "/usr/local/lib/python3.8/site-packages/distributed/comm/core.py", line 308, in connect
raise IOError(
OSError: Timed out trying to connect to <tcp://34.219.0.113:8786> after 10 s
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/distributed/comm/core.py", line 286, in connect
comm = await asyncio.wait_for(
File "/usr/local/lib/python3.8/asyncio/tasks.py", line 501, in wait_for
raise exceptions.TimeoutError()
asyncio.exceptions.TimeoutError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/prefect/engine/runner.py", line 48, in inner
new_state = method(self, state, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/prefect/engine/flow_runner.py", line 421, in get_flow_run_state
with self.check_for_cancellation(), executor.start():
File "/usr/local/lib/python3.8/contextlib.py", line 113, in __enter__
return next(self.gen)
File "/usr/local/lib/python3.8/site-packages/prefect/executors/dask.py", line 213, in start
with self.cluster_class(**self.cluster_kwargs) as cluster: # type: ignore
File "/usr/local/lib/python3.8/site-packages/dask_cloudprovider/aws/ecs.py", line 1367, in __init__
super().__init__(fargate_scheduler=True, fargate_workers=True, **kwargs)
File "/usr/local/lib/python3.8/site-packages/dask_cloudprovider/aws/ecs.py", line 733, in __init__
super().__init__(**kwargs)
File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 282, in __init__
self.sync(self._start)
File "/usr/local/lib/python3.8/site-packages/distributed/deploy/cluster.py", line 189, in sync
return sync(self.loop, func, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 353, in sync
raise exc.with_traceback(tb)
File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 336, in f
result[0] = yield future
File "/usr/local/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
value = future.result()
File "/usr/local/lib/python3.8/site-packages/dask_cloudprovider/aws/ecs.py", line 930, in _start
await super()._start()
File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 320, in _start
await self._close()
File "/usr/local/lib/python3.8/site-packages/distributed/deploy/spec.py", line 419, in _close
await self.scheduler_comm.close(close_workers=True)
File "/usr/local/lib/python3.8/site-packages/distributed/core.py", line 789, in send_recv_from_rpc
comm = await self.live_comm()
File "/usr/local/lib/python3.8/site-packages/distributed/core.py", line 747, in live_comm
comm = await connect(
File "/usr/local/lib/python3.8/site-packages/distributed/comm/core.py", line 308, in connect
raise IOError(
OSError: Timed out trying to connect to <tcp://34.219.0.113:8786> after 10 s
ciaran
04/12/2021, 3:25 PMZanie
Zanie
ciaran
04/12/2021, 3:29 PMciaran
04/12/2021, 3:31 PMResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve ecr registry auth: service call has been retried 1 time(s): RequestError: send request failed caused by: Post <https://api.ecr>....
ciaran
04/12/2021, 3:38 PMZanie
ciaran
04/12/2021, 4:41 PMCarter Kwon
04/28/2021, 12:45 AMciaran
04/28/2021, 8:49 AM