Hey Good Morning everyone Have anyone tried to use an NVIDIA Prefect Community #ask-community

Hey Good Morning everyone, Have anyone tried to u...

Sen

03/04/2022, 5:30 AM

Hey Good Morning everyone, Have anyone tried to use an NVIDIA-DOCKER as a prefect agent to run some flows. I have got the setup ready and running, but when the task which needs to use the GPU's start executing, I see it doesn't use the GPU, but instead it runs on CPU. I can see the below logs

https://pasteboard.co/kpsNsqpN3aXP.png▾

Sen

03/04/2022, 5:31 AM

How do you make this work?

Sen

03/04/2022, 5:52 AM

This is the command I use to run the prefect agent in the local machine:

docker run --rm --gpus all --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock --env-file ./.env --runtime=nvidia  nvidia_docker_agent:latest

Sen

03/04/2022, 5:55 AM

When I run the below command on the local machine, I also get the following output on nvidia-smi, which shows that the docker container also has access to the GPU as well..

docker run --rm --gpus all --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock --env-file ./.env --runtime=nvidia  nvidia_docker_agent:latest nvidia-smi

https://pasteboard.co/j4OV6UW4YoY5.png▾

Anna Geller

03/04/2022, 8:52 AM

Interesting! So the CUDA drivers seem to be installed fine. It's quite honestly a bit hard to debug because it's a non-Prefect-related issue and I have no CUDA-compatible machine to reproduce and test this atm. Could you start with the problem that you try to solve? What's your infrastructure? Do you already have some on-prem instances with GPUs that you want to utilize for your flows?

Sen

03/04/2022, 8:55 AM

yes. I do have a machine with RTX 3090 GPU which I wanted to use to host some agents for this specific task

Anna Geller

03/04/2022, 8:57 AM

in general, running a docker agent in a docker container is problematic, but you can just start the docker agent on this machine as a local process using:

Copy code

prefect agent docker start --label GPU --key CLOUD_API_KEY

and then check if the flow run containers that get spun up can utilize the GPU resources

Sen

03/04/2022, 8:58 AM

So my problem is that I need to run a logic to calculate a score called bleurt given a url to a csv. So I started of creating a prefect flow to do this and then I created a docker agent with the necessary libraries. the base image of the docker agent is

nvidia/cuda:11.4.0-runtime-ubuntu20.04

👍 1

Sen

03/04/2022, 8:59 AM

ok let me try your way..

Anna Geller

03/04/2022, 9:02 AM

"My way" sounds scary 😄 it's just one of many possible options. As an alternative to a Docker agent spun up as a local process (rather than a docker container), you can always have a local agent running on this machine, and this should work fine because Prefect will then just spin up local subprocesses for the flow run and those processes will definitely be able to utilize the GPU resources on the machine. And to manage code dependencies such as those required by e.g. the Bleurt-calculating flow you can create a virtual environment and spin up an agent in this virtual environment. I could help you configure that with Conda virtual environment since I did that in the past

upvote 1

Sen

03/04/2022, 9:05 AM

I already have a conda environment for this, but I always get confused on how to start this agent locally:

Copy code

# way - 1
prefect agent docker start --label GPU 

# way - 2
prefect agent local start --label GPU

Sen

03/04/2022, 9:07 AM

what is the difference, if both runs locally as a local subprocesses?

Anna Geller

03/04/2022, 9:07 AM

Have you signed up for Prefect Cloud already or are you still on Server? the commands above are correct and you can always add --help to get more info and check the syntax e.g.

Copy code

prefect agent local start --help

Sen

03/04/2022, 9:08 AM

yes.. I have signed up on the cloud yesterday

🎉 1

💯 1

🚀 1

Anna Geller

03/04/2022, 9:08 AM

Good question! The difference is: • docker agent subprocess spins up new flow runs as Docker containers • local agent subprocess spins up new flow runs as other local subprocesses

Sen

03/04/2022, 9:09 AM

so for accessing the GPU's in my flows, I can use both ways, is that right?

Anna Geller

03/04/2022, 9:13 AM

Yes, I believe you can 🙂 • With local agents, I'm sure they can directly utilize the GPU resources on your machine, • With the Docker agents, I'm not 100% sure since we rely on Docker-py, so you may need to configure some extra

host_config

args on the

DockerRun

to make that work - e.g. here I can see a config called device_requests which you may have to configure + they may be some extra work to make that work - I don't have enough experience with GPU-based workloads to say for sure

Sen

03/04/2022, 9:27 AM

So if I want to run this flow locally, can I change the run_config to UniversalRun().. And the storage to Local()..

Anna Geller

03/04/2022, 11:27 AM

@Sen yes, absolutely! Just pass the label of the local agent and you should be good to go 👍

Sen

03/04/2022, 11:52 AM

@Anna Geller yeah.. the local agent way actually works and I can see the GPU being used.. Do you know how to run a local agent in the background in a machine? Should I use the nohup way or is there a better way to have a local agent always running in my machine..

Anna Geller

03/04/2022, 12:08 PM

Great question! You certainly can use

nohup

, even though we usually recommend using

supervisor

. For supervisor, we have docs on how to implement that here. Basically, run this command:

Copy code

prefect agent local install --key API_KEY --label YOUR_AGENT_LABEL > supervisord.conf

and it generates the command you can use to start the process:

Copy code

supervisord -c ./supervisord.conf

Anna Geller

03/04/2022, 12:10 PM

And if you want to start the agent using e.g. a virtual environment with Conda, you can create a file

supervisord.conf

with this content:

Copy code

[unix_http_server]
file=/tmp/supervisor.sock   ; the path to the socket file

[supervisord]
loglevel=debug               ; log level; default info; others: debug,warn,trace

[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface

[supervisorctl]
serverurl=unix:///tmp/supervisor.sock ; use a unix:// URL  for a unix socket

[program:prefect-agent]
command=/Users/yourName/opt/anaconda3/envs/yourVenvName/bin/prefect agent local start --key API_KEY --label YOUR_AGENT_LABEL

and then again, start it using:

Copy code

supervisord -c ./supervisord.conf

Note that you may need to adjust the permissions because the user who starts the supervisor process needs to be able to run this and write to

supervisord.log

. So you could e.g. start the supervisor process with root user using

-u

root and you can also specify the location of your process logs with

-l

Sen

03/04/2022, 12:14 PM

thank you so much.. this is really useful.. I will try and see if I can come around right.. 🙏

Anna Geller

03/04/2022, 12:14 PM

I also assume you run it on some VM right?

Anna Geller

03/04/2022, 12:18 PM

You could e.g. make the process even more reliable by adding a

crontab

that starts this command any time you start or restart your VM - this may be useful if you ever shut down the VM (e.g. if you stop it for the night to save costs in your cloud bill) - here is how I've recently did it on Azure VM:

Copy code

echo "@reboot root supervisord -c /home/azureuser/supervisord.conf -l /home/azureuser/supervisord.log -u root" >> /etc/crontab

Anna Geller

03/04/2022, 12:19 PM

☝️ assumes you run a Linux VM

Sen

03/04/2022, 12:33 PM

nope this is not a vm but otherwise running ubuntu 20.04

👍 1

Anna Geller

03/04/2022, 12:52 PM

Oh you're right, it that's running on your laptop, then that's probably even more useful to automatically restart the agent upon reboot 😄 Feel free to post your GPU-powered flows to a public GitHub repo and share - I'm sure many from the community may find it very useful! I really like that you shared your experience with Prefect here https://github.com/Navaneethsen/prefect_docker_experiment

Sen

03/04/2022, 12:53 PM

Sure I will do that..

Kevin Kho

03/04/2022, 2:39 PM

We use dockerpy under the hood and attaching a GPU to a docker image is unsupported right now.

👍 1

Kevin Kho

03/04/2022, 2:40 PM

You can this

upvote 1

Sen

03/04/2022, 3:23 PM

Does this mean it is prefect issue where the docker can't connect to GPU?

Kevin Kho

03/04/2022, 3:30 PM

Uhh I’d prefer to say docker-py issue rather than Prefect issue cuz it’s not something we can do anything about until they expose it but I guess that’s just semantics. Sometime did type up a hack for me here.

index.html

👏 1

Kevin Kho

03/04/2022, 3:31 PM

It’s an HTML page of my own notes so you need to view it in a browser

Sen

03/04/2022, 3:37 PM

I wanted to try the local way as @Anna Geller suggested. This actually works, but I am having issues with HTTP_PROXY as my machine is inside a proxy network. Do you know how I can supply these HTTP_PROXY to the flow. Does it need to be a dictionary or can I give KEY=VAL?

Sen

03/04/2022, 3:38 PM

@Kevin Kho but I am going to try your way once I get pass this step as I will have something to demo to the team..

Kevin Kho

03/04/2022, 3:44 PM

How would you do it in native Python?

Sen

03/04/2022, 3:45 PM

for docker I would pass them in a .env file

Sen

03/04/2022, 3:45 PM

but when I try to use the same file it is complaining about the dictionary structure

Sen

03/04/2022, 3:45 PM

I never tried this in a local run yet

Sen

03/04/2022, 3:47 PM

right now I constructed a file with the dict structure {k0:v0, k1:v1, etc.} and planning to pass it with

prefect agent local start -e env_dict_file

Sen

03/04/2022, 3:47 PM

Copy code

Traceback (most recent call last):
  File "/home/sen/anaconda3/envs/prefect_py37/bin/prefect", line 8, in <module>
    sys.exit(cli())
  File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/cli/agent.py", line 182, in start
    start_agent(LocalAgent, import_paths=list(import_paths), **kwargs)
  File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/cli/agent.py", line 130, in start_agent
    env_vars = dict(e.split("=", 1) for e in env)
ValueError: dictionary update sequence element #0 has length 1; 2 is required

Sen

03/04/2022, 3:47 PM

this is the error I get

Anna Geller

03/04/2022, 3:47 PM

@Sen my understanding now is: Kevin’s notes demonstrate that the Docker option is possible but they also prove that this is way more challenging to set up. So I would use the local agent option at first for your initial PoC and see how it goes. Once you hit some limits with the local agent and a single VM, the chances are that you need a cluster of machines anyway, and then you may need to dockerize the process to distribute it across several instances. when it comes to this HTTP_PROXY, you only need outbound traffic from your VM to Prefect Cloud so not sure if setting this proxy is needed - can you explain it a bit more?

Sen

03/04/2022, 3:49 PM

so in one of my tasks I am trying to contact the Azure key Vault to get some keys.. the connection to the keyvault fails throwing the below error:

Copy code

Task 'task_authenticate_and_get_keys': Exception encountered during task execution! Traceback (most recent call last): File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/engine/task_runner.py", line 880, in get_task_run_state logger=self.logger, File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/utilities/executors.py", line 467, in run_task_with_timeout return task.run(*args, **kwargs) # type: ignore File "on_prem_translationevaluator_flow/flow_local.py", line 171, in task_authenticate_and_get_keys secrets["COSMOS_WRITE_URL"] = project_odin_key_vault_client.get_secret("COSMOS-WRITE-URL").value File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/tracing/decorator.py", line 83, in wrapper_use_tracer return func(*args, **kwargs) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/keyvault/secrets/_client.py", line 72, in get_secret **kwargs File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/keyvault/secrets/_generated/_operations_mixin.py", line 1475, in get_secret return mixin_instance.get_secret(vault_base_url, secret_name, secret_version, **kwargs) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/keyvault/secrets/_generated/v7_1/operations/_key_vault_client_operations.py", line 276, in get_secret pipeline_response = self._client._pipeline.run(request, stream=False, **kwargs) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 211, in run return first_node.send(pipeline_request) # type: ignore File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) [Previous line repeated 2 more times] File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/policies/_redirect.py", line 158, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/policies/_retry.py", line 457, in send raise err File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/policies/_retry.py", line 435, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/keyvault/secrets/_shared/challenge_auth_policy.py", line 104, in send challenger = self.next.send(challenge_request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) [Previous line repeated 1 more time] File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 103, in send self._sender.send(request.http_request, **request.context.options), File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/transport/_requests_basic.py", line 285, in send raise error azure.core.exceptions.ServiceRequestError: ('Cannot connect to proxy.', OSError('Tunnel connection failed: 502 Proxy Error ( Host was not found )'))

Anna Geller

03/04/2022, 3:50 PM

regarding your env variable issue, I answered exactly the same question today to another user 🙂 it’s here: https://prefect-community.slack.com/archives/CL09KU1K7/p1646388377852089?thread_ts=1646355042.553639&cid=CL09KU1K7

Anna Geller

03/04/2022, 3:53 PM

ok, I read more and you’re right - setting env variables should indeed fix your issue. @Sen you can do that as explained in the post above by attaching the config read with python-dotenv package to your

UniversalRun

run config

Sen

03/04/2022, 3:57 PM

yeah.. tried the dotenv and it works now.. shoo.. 👏👏well done @Anna Geller @Kevin Kho.. I don't think I get this kind of support from anywhere else.. 🤩

🙌 2

Sen

03/04/2022, 7:30 PM

How do I control the number of parallel runs on this local agent? because it is using the GPU's, I am getting the RESOURCE_EXHAUSTED error when two runs happen parallely. Can this be controlled by any way? Or do I need to control this at my side?

Copy code

Task 'calculate_bleurt_score': Exception encountered during task execution! Traceback (most recent call last): File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/engine/task_runner.py", line 880, in get_task_run_state logger=self.logger, File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/utilities/executors.py", line 467, in run_task_with_timeout return task.run(*args, **kwargs) # type: ignore File "on_prem_translationevaluator_flow/flow_local.py", line 355, in calculate_bleurt_score scorer = score.BleurtScorer(checkpoint) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/bleurt/score.py", line 173, in __init__ self._predictor.initialize() File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/bleurt/score.py", line 63, in initialize imported = tf.saved_model.load(self.checkpoint) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load.py", line 936, in load result = load_internal(export_dir, tags, options)["root"] File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load.py", line 994, in load_internal root = load_v1_in_v2.load(export_dir, tags) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load_v1_in_v2.py", line 282, in load result = loader.load(tags=tags) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load_v1_in_v2.py", line 230, in load self.restore_variables(wrapped, restore_from_saver) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load_v1_in_v2.py", line 114, in restore_variables constant_op.constant(self._variables_path)) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1601, in __call__ return self._call_impl(args, kwargs) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/wrap_function.py", line 244, in _call_impl args, kwargs, cancellation_manager) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1619, in _call_impl return self._call_with_flat_signature(args, kwargs, cancellation_manager) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1668, in _call_with_flat_signature return self._call_flat(args, self.captured_inputs, cancellation_manager) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1854, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 504, in call ctx=ctx) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 55, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.ResourceExhaustedError: Graph execution error: 2 root error(s) found. (0) RESOURCE_EXHAUSTED: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;8ccf64172d29d439;/job:localhost/replica:0/task:0/device:GPU:0;edge_267_save/RestoreV2;0:0 [[{{node save/RestoreV2/_262}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode. [[save/RestoreV2/_403]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode. (1) RESOURCE_EXHAUSTED: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;8ccf64172d29d439;/job:localhost/replica:0/task:0/device:GPU:0;edge_267_save/RestoreV2;0:0 [[{{node save/RestoreV2/_262}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode. 0 successful operations. 0 derived errors ignored. [Op:__inference_pruned_13295]

Sen

03/04/2022, 7:33 PM

No worries.. I read it in the documentation https://docs.prefect.io/orchestration/flow-runs/concurrency-limits.html#flow-run-limits It looks like this is a premium feature.

👍 2

252 Views

Open in Slack

Previous Next