Been strugging to use coiled + prefect to bring up gpu worke Prefect Community #ask-community

Been strugging to use coiled + prefect to bring up...

Brett Jurman

07/22/2021, 10:47 AM

Been strugging to use coiled + prefect to bring up gpu workers with pytorch + cuda. Has anyone successfully done this? I bought a pro coiled account, and put gpu_workers=1, installed cuda_toolkit =10.2 as part of my conda package and used the base container recommended in the docs

Brett Jurman

07/22/2021, 10:48 AM

My software config:

Copy code

coiled.create_software_environment(
    name="gpu-env4",
    container="gpuci/miniconda-cuda:10.2-runtime-ubuntu18.04",
    conda={
        "channels": ["conda-forge", "defaults", "fastchan"],
        "dependencies": [
            "python==3.8",
            "pytorch",
            "torchvision",
            "cudatoolkit=10.2",
            "prefect", 
            "fastai",
            "scikit-image",
            "numpy",
            "dask",
            "bokeh>=0.13.0",
        ]
    })

Brett Jurman

07/22/2021, 10:49 AM

My cluster config:

Copy code

executor = DaskExecutor(
    cluster_class=coiled.Cluster,
    cluster_kwargs={
        "software": "gpu-env4",
        "shutdown_on_close": False,
        "name": "prefect-executor",
        "worker_memory": "15 GiB",
        "worker_gpu": 1,
        "account": "(my account id)"
    },
)

Brett Jurman

07/22/2021, 10:50 AM

If I do this I get:

Copy code

Task 'run_model': Exception encountered during task execution!
Traceback (most recent call last):
  File "/opt/conda/envs/coiled/lib/python3.8/site-packages/prefect/engine/task_runner.py", line 861, in get_task_run_state
    value = prefect.utilities.executors.run_task_with_timeout(
  File "/opt/conda/envs/coiled/lib/python3.8/site-packages/prefect/utilities/executors.py", line 328, in run_task_with_timeout
    return task.run(*args, **kwargs)  # type: ignore
  File "/tmp/ipykernel_29387/1929835160.py", line 61, in run_model
  File "/opt/conda/envs/coiled/lib/python3.8/site-packages/torch/nn/modules/module.py", line 673, in to
    return self._apply(convert)
  File "/opt/conda/envs/coiled/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/opt/conda/envs/coiled/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/opt/conda/envs/coiled/lib/python3.8/site-packages/torch/nn/modules/module.py", line 387, in _apply
    module._apply(fn)
  File "/opt/conda/envs/coiled/lib/python3.8/site-packages/torch/nn/modules/module.py", line 409, in _apply
    param_applied = fn(param)
  File "/opt/conda/envs/coiled/lib/python3.8/site-packages/torch/nn/modules/module.py", line 671, in convert
    return <http://t.to|t.to>(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
  File "/opt/conda/envs/coiled/lib/python3.8/site-packages/torch/cuda/__init__.py", line 164, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled

Kevin Kho

07/22/2021, 1:39 PM

Hey @Brett Jurman, are you using DockerRun by chance?

Brett Jurman

07/22/2021, 1:43 PM

Not that i know of

Brett Jurman

07/22/2021, 1:43 PM

I'm not familiar with which component that would be

Kevin Kho

07/22/2021, 1:47 PM

Have you tried installing

dask-cuda

in the software environment also?

Brett Jurman

07/22/2021, 2:05 PM

i have not, i didn't think of doing that

Kevin Kho

07/22/2021, 2:06 PM

I am just comparing against the list of packages they have here

Brett Jurman

07/22/2021, 2:06 PM

Yeah thats what im basing it on

Brett Jurman

07/22/2021, 2:06 PM

i don't think i need that necessarily, but ill try it now

Brett Jurman

07/22/2021, 2:07 PM

ill let you know if that works, but i'd be slightly suprised.

Brett Jurman

07/22/2021, 2:07 PM

here is my updated config:

Copy code

coiled.create_software_environment(
    name="gpu-env-cuda-dask",
    container="gpuci/miniconda-cuda:10.2-runtime-ubuntu18.04",
    conda={
        "channels": ["conda-forge", "defaults", "fastchan"],
        "dependencies": [
            "python==3.8",
            "pytorch",
            "torchvision",
            "cudatoolkit=10.2",
            "prefect", 
            "fastai",
            "scikit-image",
            "numpy",
            "dask",
            "bokeh>=0.13.0",
            "dask-cuda"
        ]
    })

Kevin Kho

07/22/2021, 2:11 PM

I don’t know either if it’ll work but I suspect it’s needed to bring CUDA to the other workers (4 workers on a cluster by default.)

Brett Jurman

07/22/2021, 2:12 PM

im giving it a try! thanks for the advie

Kevin Kho

07/22/2021, 3:07 PM

If this still fails, I think this is more of a Coiled issue and I think they can help you better on their Slack channel

Brett Jurman

07/22/2021, 3:58 PM

thanks, ill try them after

Brett Jurman

07/22/2021, 3:58 PM

im about to test it

Kevin Kho

07/22/2021, 8:35 PM

Saw you post in Coiled. Actually I realized that this error might be caused by the agent not having the packages. Are you using a local agent? If your import statements are at the top, you would need them the libraries installed on the Local Agent.

Kevin Kho

07/22/2021, 8:39 PM

Are you running a registered flow or using

flow.run()

? Is the agent running on the same machine that you develop on?

Brett Jurman

07/23/2021, 2:50 PM

Hey Sorry, I think we figured it out via the coiled thread. I am running it on the coiled cluster I think as you surmised. I think the missing thing is you need to use the VM backend + you should disable spot because you will probabl ynot get workers

Kevin Kho

07/23/2021, 2:52 PM

Yeah I ran into the same problems as you until I followed Fabio’s suggestion on spot instances. I am honestly not 100% sure though that the GPU is detected and there still might be issues get Dask to detect those. Not sure. Glad we got something up though, it was my first time trying also.

Open in Slack

Previous Next