I ve made some good progress getting a more real dask cluste Prefect Community #ask-community

I've made some good progress getting a more "real"...

Brian McFeeley

07/31/2019, 9:30 PM

I've made some good progress getting a more "real" dask cluster setup, but i'm still running into some issues with tasks/workers being lost. Anecdotally, I see this problem arise more frequently when I create more workers. Things will hum along smoothly, then right as the flow should be ending, it seems like the connectivity between workers and scheduler is disrupted:

Copy code

2019-07-31T21:26:08.029Z [dask-cluster-worker b47e4b0c1a70]: distributed.worker - INFO - Stopping worker at <tcp://elb-trialspark-19870.aptible.in:45604>
2019-07-31T21:26:08.031Z [dask-cluster-worker b47e4b0c1a70]: tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7fb4fcd2d950>, <Future finished exception=TypeError("'NoneType' object is not subscriptable")>)
2019-07-31T21:26:08.031Z [dask-cluster-worker b47e4b0c1a70]: Traceback (most recent call last):
2019-07-31T21:26:08.031Z [dask-cluster-worker b47e4b0c1a70]:   File "/usr/local/lib/python3.7/site-packages/tornado/ioloop.py", line 758, in _run_callback
2019-07-31T21:26:08.031Z [dask-cluster-worker b47e4b0c1a70]:     ret = callback()
2019-07-31T21:26:08.031Z [dask-cluster-worker b47e4b0c1a70]:   File "/usr/local/lib/python3.7/site-packages/tornado/stack_context.py", line 300, in null_wrapper
2019-07-31T21:26:08.031Z [dask-cluster-worker b47e4b0c1a70]:     return fn(*args, **kwargs)
2019-07-31T21:26:08.031Z [dask-cluster-worker b47e4b0c1a70]:   File "/usr/local/lib/python3.7/site-packages/tornado/ioloop.py", line 779, in _discard_future_result
2019-07-31T21:26:08.031Z [dask-cluster-worker b47e4b0c1a70]:     future.result()
2019-07-31T21:26:08.031Z [dask-cluster-worker b47e4b0c1a70]:   File "/usr/local/lib/python3.7/site-packages/tornado/gen.py", line 1147, in run
2019-07-31T21:26:08.031Z [dask-cluster-worker b47e4b0c1a70]:     yielded = self.gen.send(value)
2019-07-31T21:26:08.031Z [dask-cluster-worker b47e4b0c1a70]:   File "/usr/local/lib/python3.7/site-packages/distributed/worker.py", line 796, in heartbeat
2019-07-31T21:26:08.031Z [dask-cluster-worker b47e4b0c1a70]:     if response["status"] == "missing":
2019-07-31T21:26:08.031Z [dask-cluster-worker b47e4b0c1a70]: TypeError: 'NoneType' object is not subscriptable
2019-07-31T21:26:08.036Z [dask-cluster-worker b47e4b0c1a70]: distributed.nanny - INFO - Closing Nanny at '<tcp://172.17.0.67:43533>'
2019-07-31T21:26:08.037Z [dask-cluster-worker b47e4b0c1a70]: distributed.worker - INFO - Connection to scheduler broken.  Reconnecting...

This causes us to rerun a large portion of the previously completed tasks whose results were not persisted. I still strongly suspect the issue lies in our deployment environment -- Aptible, our PaaS, routinely kills and restarts containers that meet or exceed their memory limits, for example. I've reached out to them to get some logs to see if they're restarting these containers, but if this problem is at all familiar let me know if you have a workaround.

2 Views

Open in Slack

Previous Next