m

    Matic Lubej

    1 year ago
    Hi again! I'm trying to run a process over a Fargate cluster using
    dask_cloudprovider
    API. running tutorials and sample code from
    dask
    this works great, but for prefect I have created a dedicated docker image which I provide to the cluster initializer. The cluster gets created, but as soon as the flow starts, after 10 s I get the following time-out error:
    Traceback (most recent call last):
      File "/usr/local/lib/python3.8/site-packages/prefect/engine/runner.py", line 48, in inner
        new_state = method(self, state, *args, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/prefect/engine/flow_runner.py", line 418, in get_flow_run_state
        with self.check_for_cancellation(), executor.start():
      File "/usr/local/lib/python3.8/contextlib.py", line 113, in __enter__
        return next(self.gen)
      File "/usr/local/lib/python3.8/site-packages/prefect/executors/dask.py", line 203, in start
        with Client(self.address, **self.client_kwargs) as client:
      File "/usr/local/lib/python3.8/site-packages/distributed/client.py", line 748, in __init__
        self.start(timeout=timeout)
      File "/usr/local/lib/python3.8/site-packages/distributed/client.py", line 953, in start
        sync(self.loop, self._start, **kwargs)
      File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 340, in sync
        raise exc.with_traceback(tb)
      File "/usr/local/lib/python3.8/site-packages/distributed/utils.py", line 324, in f
        result[0] = yield future
      File "/usr/local/lib/python3.8/site-packages/tornado/gen.py", line 762, in run
        value = future.result()
      File "/usr/local/lib/python3.8/site-packages/distributed/client.py", line 1043, in _start
        await self._ensure_connected(timeout=timeout)
      File "/usr/local/lib/python3.8/site-packages/distributed/client.py", line 1100, in _ensure_connected
        comm = await connect(
      File "/usr/local/lib/python3.8/site-packages/distributed/comm/core.py", line 308, in connect
        raise IOError(
    OSError: Timed out trying to connect to <tcp://172.31.40.184:8786> after 10 s
    [2021-01-19 20:14:19+0000] ERROR - prefect.Execute process | Unexpected error occured in FlowRunner: OSError('Timed out trying to connect to <tcp://172.31.40.184:8786> after 10 s')
    Traceback (most recent call last):
      File "s3_process_l2a_2019.py", line 114, in <module>
        assert status.is_successful()
    AssertionError
    Any ideas what is going on? Is the dask scheduler having issues connecting to the workers? Or might it be something else? Thanks!
    j

    josh

    1 year ago
    Hi @Matic Lubej I can’t say for certain based on the information you provided but it looks like wherever your flow is running it cannot communicate with your Dask scheduler
    m

    Matic Lubej

    1 year ago
    It turned out hat I had to add the security information to the dask executor, info provided in https://docs.prefect.io/core/advanced_tutorials/dask-cluster.html#an-example-flow Thanks!