Been strugging to use coiled + prefect to bring up...
# ask-community
b
Been strugging to use coiled + prefect to bring up gpu workers with pytorch + cuda. Has anyone successfully done this? I bought a pro coiled account, and put gpu_workers=1, installed cuda_toolkit =10.2 as part of my conda package and used the base container recommended in the docs
My software config:
Copy code
coiled.create_software_environment(
    name="gpu-env4",
    container="gpuci/miniconda-cuda:10.2-runtime-ubuntu18.04",
    conda={
        "channels": ["conda-forge", "defaults", "fastchan"],
        "dependencies": [
            "python==3.8",
            "pytorch",
            "torchvision",
            "cudatoolkit=10.2",
            "prefect", 
            "fastai",
            "scikit-image",
            "numpy",
            "dask",
            "bokeh>=0.13.0",
        ]
    })
My cluster config:
Copy code
executor = DaskExecutor(
    cluster_class=coiled.Cluster,
    cluster_kwargs={
        "software": "gpu-env4",
        "shutdown_on_close": False,
        "name": "prefect-executor",
        "worker_memory": "15 GiB",
        "worker_gpu": 1,
        "account": "(my account id)"
    },
)
If I do this I get:
Copy code
Task 'run_model': Exception encountered during task execution!
Traceback (most recent call last):
  File "/opt/conda/envs/coiled/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 861, in get_task_run_state
    value = prefect.utilities.executors.run_task_with_timeout(
  File "/opt/conda/envs/coiled/lib/python3.8/site-packages/prefect/utilities/executors.py", line 328, in run_task_with_timeout
    return task.run(*args, **kwargs)  # type: ignore
  File "/tmp/ipykernel_29387/1929835160.py", line 61, in run_model
  File "/opt/conda/envs/coiled/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to
    return self._apply(convert)
  File "/opt/conda/envs/coiled/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/opt/conda/envs/coiled/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/opt/conda/envs/coiled/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/opt/conda/envs/coiled/lib/python3.8/site-packages/torch/nn/modules/module.py", line 409, in _apply
    param_applied = fn(param)
  File "/opt/conda/envs/coiled/lib/python3.8/site-packages/torch/nn/modules/module.py", line 671, in convert
    return <http://t.to|t.to>(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/opt/conda/envs/coiled/lib/python3.8/site-packages/torch/cuda/__init__.py", line 164, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
k
Hey @Brett Jurman, are you using DockerRun by chance?
b
Not that i know of
I'm not familiar with which component that would be
k
Have you tried installing
dask-cuda
in the software environment also?
b
i have not, i didn't think of doing that
k
I am just comparing against the list of packages they have here
b
Yeah thats what im basing it on
i don't think i need that necessarily, but ill try it now
ill let you know if that works, but i'd be slightly suprised.
here is my updated config:
Copy code
coiled.create_software_environment(
    name="gpu-env-cuda-dask",
    container="gpuci/miniconda-cuda:10.2-runtime-ubuntu18.04",
    conda={
        "channels": ["conda-forge", "defaults", "fastchan"],
        "dependencies": [
            "python==3.8",
            "pytorch",
            "torchvision",
            "cudatoolkit=10.2",
            "prefect", 
            "fastai",
            "scikit-image",
            "numpy",
            "dask",
            "bokeh>=0.13.0",
            "dask-cuda"
        ]
    })
k
I don’t know either if it’ll work but I suspect it’s needed to bring CUDA to the other workers (4 workers on a cluster by default.)
b
im giving it a try! thanks for the advie
k
If this still fails, I think this is more of a Coiled issue and I think they can help you better on their Slack channel
b
thanks, ill try them after
im about to test it
k
Saw you post in Coiled. Actually I realized that this error might be caused by the agent not having the packages. Are you using a local agent? If your import statements are at the top, you would need them the libraries installed on the Local Agent.
Are you running a registered flow or using
flow.run()
? Is the agent running on the same machine that you develop on?
b
Hey Sorry, I think we figured it out via the coiled thread. I am running it on the coiled cluster I think as you surmised. I think the missing thing is you need to use the VM backend + you should disable spot because you will probabl ynot get workers
k
Yeah I ran into the same problems as you until I followed Fabio’s suggestion on spot instances. I am honestly not 100% sure though that the GPU is detected and there still might be issues get Dask to detect those. Not sure. Glad we got something up though, it was my first time trying also.