Hey Good Morning everyone, Have anyone tried to u...
# prefect-community
s
Hey Good Morning everyone, Have anyone tried to use an NVIDIA-DOCKER as a prefect agent to run some flows. I have got the setup ready and running, but when the task which needs to use the GPU's start executing, I see it doesn't use the GPU, but instead it runs on CPU. I can see the below logs

https://pasteboard.co/kpsNsqpN3aXP.png

How do you make this work?
This is the command I use to run the prefect agent in the local machine:
docker run --rm --gpus all --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock --env-file ./.env --runtime=nvidia  nvidia_docker_agent:latest
When I run the below command on the local machine, I also get the following output on nvidia-smi, which shows that the docker container also has access to the GPU as well..
docker run --rm --gpus all --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock --env-file ./.env --runtime=nvidia  nvidia_docker_agent:latest nvidia-smi

https://pasteboard.co/j4OV6UW4YoY5.png

a
Interesting! So the CUDA drivers seem to be installed fine. It's quite honestly a bit hard to debug because it's a non-Prefect-related issue and I have no CUDA-compatible machine to reproduce and test this atm. Could you start with the problem that you try to solve? What's your infrastructure? Do you already have some on-prem instances with GPUs that you want to utilize for your flows?
s
yes. I do have a machine with RTX 3090 GPU which I wanted to use to host some agents for this specific task
a
in general, running a docker agent in a docker container is problematic, but you can just start the docker agent on this machine as a local process using:
Copy code
prefect agent docker start --label GPU --key CLOUD_API_KEY
and then check if the flow run containers that get spun up can utilize the GPU resources
s
So my problem is that I need to run a logic to calculate a score called bleurt given a url to a csv. So I started of creating a prefect flow to do this and then I created a docker agent with the necessary libraries. the base image of the docker agent is
nvidia/cuda:11.4.0-runtime-ubuntu20.04
👍 1
ok let me try your way..
a
"My way" sounds scary 😄 it's just one of many possible options. As an alternative to a Docker agent spun up as a local process (rather than a docker container), you can always have a local agent running on this machine, and this should work fine because Prefect will then just spin up local subprocesses for the flow run and those processes will definitely be able to utilize the GPU resources on the machine. And to manage code dependencies such as those required by e.g. the Bleurt-calculating flow you can create a virtual environment and spin up an agent in this virtual environment. I could help you configure that with Conda virtual environment since I did that in the past
upvote 1
s
I already have a conda environment for this, but I always get confused on how to start this agent locally:
Copy code
# way - 1
prefect agent docker start --label GPU 

# way - 2
prefect agent local start --label GPU
what is the difference, if both runs locally as a local subprocesses?
a
Have you signed up for Prefect Cloud already or are you still on Server? the commands above are correct and you can always add --help to get more info and check the syntax e.g.
Copy code
prefect agent local start --help
s
yes.. I have signed up on the cloud yesterday
🎉 1
💯 1
🚀 1
a
Good question! The difference is: • docker agent subprocess spins up new flow runs as Docker containers • local agent subprocess spins up new flow runs as other local subprocesses
s
so for accessing the GPU's in my flows, I can use both ways, is that right?
a
Yes, I believe you can 🙂 • With local agents, I'm sure they can directly utilize the GPU resources on your machine, • With the Docker agents, I'm not 100% sure since we rely on Docker-py, so you may need to configure some extra
host_config
args on the
DockerRun
to make that work - e.g. here I can see a config called device_requests which you may have to configure + they may be some extra work to make that work - I don't have enough experience with GPU-based workloads to say for sure
s
So if I want to run this flow locally, can I change the run_config to UniversalRun().. And the storage to Local()..
a
@Sen yes, absolutely! Just pass the label of the local agent and you should be good to go 👍
s
@Anna Geller yeah.. the local agent way actually works and I can see the GPU being used.. Do you know how to run a local agent in the background in a machine? Should I use the nohup way or is there a better way to have a local agent always running in my machine..
a
Great question! You certainly can use
nohup
, even though we usually recommend using
supervisor
. For supervisor, we have docs on how to implement that here. Basically, run this command:
Copy code
prefect agent local install --key API_KEY --label YOUR_AGENT_LABEL > supervisord.conf
and it generates the command you can use to start the process:
Copy code
supervisord -c ./supervisord.conf
And if you want to start the agent using e.g. a virtual environment with Conda, you can create a file
supervisord.conf
with this content:
Copy code
[unix_http_server]
file=/tmp/supervisor.sock   ; the path to the socket file

[supervisord]
loglevel=debug               ; log level; default info; others: debug,warn,trace

[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

[supervisorctl]
serverurl=unix:///tmp/supervisor.sock ; use a unix:// URL  for a unix socket

[program:prefect-agent]
command=/Users/yourName/opt/anaconda3/envs/yourVenvName/bin/prefect agent local start --key API_KEY --label YOUR_AGENT_LABEL
and then again, start it using:
Copy code
supervisord -c ./supervisord.conf
Note that you may need to adjust the permissions because the user who starts the supervisor process needs to be able to run this and write to
supervisord.log
. So you could e.g. start the supervisor process with root user using
-u
root and you can also specify the location of your process logs with
-l
.
s
thank you so much.. this is really useful.. I will try and see if I can come around right.. 🙏
a
I also assume you run it on some VM right?
You could e.g. make the process even more reliable by adding a
crontab
that starts this command any time you start or restart your VM - this may be useful if you ever shut down the VM (e.g. if you stop it for the night to save costs in your cloud bill) - here is how I've recently did it on Azure VM:
Copy code
echo "@reboot root supervisord -c /home/azureuser/supervisord.conf -l /home/azureuser/supervisord.log -u root" >> /etc/crontab
☝️ assumes you run a Linux VM
s
nope this is not a vm but otherwise running ubuntu 20.04
👍 1
a
Oh you're right, it that's running on your laptop, then that's probably even more useful to automatically restart the agent upon reboot 😄 Feel free to post your GPU-powered flows to a public GitHub repo and share - I'm sure many from the community may find it very useful! I really like that you shared your experience with Prefect here https://github.com/Navaneethsen/prefect_docker_experiment
s
Sure I will do that..
k
We use dockerpy under the hood and attaching a GPU to a docker image is unsupported right now.
👍 1
You can this
upvote 1
s
Does this mean it is prefect issue where the docker can't connect to GPU?
k
Uhh I’d prefer to say docker-py issue rather than Prefect issue cuz it’s not something we can do anything about until they expose it but I guess that’s just semantics. Sometime did type up a hack for me here.
👏 1
It’s an HTML page of my own notes so you need to view it in a browser
s
I wanted to try the local way as @Anna Geller suggested. This actually works, but I am having issues with HTTP_PROXY as my machine is inside a proxy network. Do you know how I can supply these HTTP_PROXY to the flow. Does it need to be a dictionary or can I give KEY=VAL?
@Kevin Kho but I am going to try your way once I get pass this step as I will have something to demo to the team..
k
How would you do it in native Python?
s
for docker I would pass them in a .env file
but when I try to use the same file it is complaining about the dictionary structure
I never tried this in a local run yet
right now I constructed a file with the dict structure {k0:v0, k1:v1, etc.} and planning to pass it with
prefect agent local start -e env_dict_file
Copy code
Traceback (most recent call last):
  File "/home/sen/anaconda3/envs/prefect_py37/bin/prefect", line 8, in <module>
    sys.exit(cli())
  File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/cli/agent.py", line 182, in start
    start_agent(LocalAgent, import_paths=list(import_paths), **kwargs)
  File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/cli/agent.py", line 130, in start_agent
    env_vars = dict(e.split("=", 1) for e in env)
ValueError: dictionary update sequence element #0 has length 1; 2 is required
this is the error I get
a
@Sen my understanding now is: Kevin’s notes demonstrate that the Docker option is possible but they also prove that this is way more challenging to set up. So I would use the local agent option at first for your initial PoC and see how it goes. Once you hit some limits with the local agent and a single VM, the chances are that you need a cluster of machines anyway, and then you may need to dockerize the process to distribute it across several instances. when it comes to this HTTP_PROXY, you only need outbound traffic from your VM to Prefect Cloud so not sure if setting this proxy is needed - can you explain it a bit more?
s
so in one of my tasks I am trying to contact the Azure key Vault to get some keys.. the connection to the keyvault fails throwing the below error:
Copy code
Task 'task_authenticate_and_get_keys': Exception encountered during task execution! Traceback (most recent call last): File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/engine/task_runner.py", line 880, in get_task_run_state logger=self.logger, File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/utilities/executors.py", line 467, in run_task_with_timeout return task.run(*args, **kwargs) # type: ignore File "on_prem_translationevaluator_flow/flow_local.py", line 171, in task_authenticate_and_get_keys secrets["COSMOS_WRITE_URL"] = project_odin_key_vault_client.get_secret("COSMOS-WRITE-URL").value File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/tracing/decorator.py", line 83, in wrapper_use_tracer return func(*args, **kwargs) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/keyvault/secrets/_client.py", line 72, in get_secret **kwargs File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/keyvault/secrets/_generated/_operations_mixin.py", line 1475, in get_secret return mixin_instance.get_secret(vault_base_url, secret_name, secret_version, **kwargs) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/keyvault/secrets/_generated/v7_1/operations/_key_vault_client_operations.py", line 276, in get_secret pipeline_response = self._client._pipeline.run(request, stream=False, **kwargs) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 211, in run return first_node.send(pipeline_request) # type: ignore File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) [Previous line repeated 2 more times] File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/policies/_redirect.py", line 158, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/policies/_retry.py", line 457, in send raise err File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/policies/_retry.py", line 435, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/keyvault/secrets/_shared/challenge_auth_policy.py", line 104, in send challenger = self.next.send(challenge_request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) [Previous line repeated 1 more time] File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 103, in send self._sender.send(request.http_request, **request.context.options), File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/transport/_requests_basic.py", line 285, in send raise error azure.core.exceptions.ServiceRequestError: ('Cannot connect to proxy.', OSError('Tunnel connection failed: 502 Proxy Error ( Host was not found )'))
a
regarding your env variable issue, I answered exactly the same question today to another user 🙂 it’s here: https://prefect-community.slack.com/archives/CL09KU1K7/p1646388377852089?thread_ts=1646355042.553639&amp;cid=CL09KU1K7
ok, I read more and you’re right - setting env variables should indeed fix your issue. @Sen you can do that as explained in the post above by attaching the config read with python-dotenv package to your
UniversalRun
run config
s
yeah.. tried the dotenv and it works now.. shoo.. 👏👏well done @Anna Geller @Kevin Kho.. I don't think I get this kind of support from anywhere else.. 🤩
🙌 2
How do I control the number of parallel runs on this local agent? because it is using the GPU's, I am getting the RESOURCE_EXHAUSTED error when two runs happen parallely. Can this be controlled by any way? Or do I need to control this at my side?
Copy code
Task 'calculate_bleurt_score': Exception encountered during task execution! Traceback (most recent call last): File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/engine/task_runner.py", line 880, in get_task_run_state logger=self.logger, File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/utilities/executors.py", line 467, in run_task_with_timeout return task.run(*args, **kwargs) # type: ignore File "on_prem_translationevaluator_flow/flow_local.py", line 355, in calculate_bleurt_score scorer = score.BleurtScorer(checkpoint) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/bleurt/score.py", line 173, in __init__ self._predictor.initialize() File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/bleurt/score.py", line 63, in initialize imported = tf.saved_model.load(self.checkpoint) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load.py", line 936, in load result = load_internal(export_dir, tags, options)["root"] File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load.py", line 994, in load_internal root = load_v1_in_v2.load(export_dir, tags) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load_v1_in_v2.py", line 282, in load result = loader.load(tags=tags) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load_v1_in_v2.py", line 230, in load self.restore_variables(wrapped, restore_from_saver) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load_v1_in_v2.py", line 114, in restore_variables constant_op.constant(self._variables_path)) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1601, in __call__ return self._call_impl(args, kwargs) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/wrap_function.py", line 244, in _call_impl args, kwargs, cancellation_manager) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1619, in _call_impl return self._call_with_flat_signature(args, kwargs, cancellation_manager) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1668, in _call_with_flat_signature return self._call_flat(args, self.captured_inputs, cancellation_manager) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1854, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 504, in call ctx=ctx) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 55, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.ResourceExhaustedError: Graph execution error: 2 root error(s) found. (0) RESOURCE_EXHAUSTED: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;8ccf64172d29d439;/job:localhost/replica:0/task:0/device:GPU:0;edge_267_save/RestoreV2;0:0 [[{{node save/RestoreV2/_262}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode. [[save/RestoreV2/_403]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode. (1) RESOURCE_EXHAUSTED: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;8ccf64172d29d439;/job:localhost/replica:0/task:0/device:GPU:0;edge_267_save/RestoreV2;0:0 [[{{node save/RestoreV2/_262}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode. 0 successful operations. 0 derived errors ignored. [Op:__inference_pruned_13295]
No worries.. I read it in the documentation https://docs.prefect.io/orchestration/flow-runs/concurrency-limits.html#flow-run-limits It looks like this is a premium feature.
👍 2
183 Views