Sen
03/04/2022, 5:30 AMhttps://pasteboard.co/kpsNsqpN3aXP.png▾
docker run --rm --gpus all --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock --env-file ./.env --runtime=nvidia nvidia_docker_agent:latest
docker run --rm --gpus all --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock --env-file ./.env --runtime=nvidia nvidia_docker_agent:latest nvidia-smi
https://pasteboard.co/j4OV6UW4YoY5.png▾
Anna Geller
03/04/2022, 8:52 AMSen
03/04/2022, 8:55 AMAnna Geller
03/04/2022, 8:57 AMprefect agent docker start --label GPU --key CLOUD_API_KEY
and then check if the flow run containers that get spun up can utilize the GPU resourcesSen
03/04/2022, 8:58 AMnvidia/cuda:11.4.0-runtime-ubuntu20.04
Anna Geller
03/04/2022, 9:02 AMSen
03/04/2022, 9:05 AM# way - 1
prefect agent docker start --label GPU
# way - 2
prefect agent local start --label GPU
Anna Geller
03/04/2022, 9:07 AMprefect agent local start --help
Sen
03/04/2022, 9:08 AMAnna Geller
03/04/2022, 9:08 AMSen
03/04/2022, 9:09 AMAnna Geller
03/04/2022, 9:13 AMhost_config
args on the DockerRun
to make that work - e.g. here I can see a config called device_requests which you may have to configure + they may be some extra work to make that work - I don't have enough experience with GPU-based workloads to say for sureSen
03/04/2022, 9:27 AMAnna Geller
03/04/2022, 11:27 AMSen
03/04/2022, 11:52 AMAnna Geller
03/04/2022, 12:08 PMnohup
, even though we usually recommend using supervisor
.
For supervisor, we have docs on how to implement that here.
Basically, run this command:
prefect agent local install --key API_KEY --label YOUR_AGENT_LABEL > supervisord.conf
and it generates the command you can use to start the process:
supervisord -c ./supervisord.conf
supervisord.conf
with this content:
[unix_http_server]
file=/tmp/supervisor.sock ; the path to the socket file
[supervisord]
loglevel=debug ; log level; default info; others: debug,warn,trace
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
[supervisorctl]
serverurl=unix:///tmp/supervisor.sock ; use a unix:// URL for a unix socket
[program:prefect-agent]
command=/Users/yourName/opt/anaconda3/envs/yourVenvName/bin/prefect agent local start --key API_KEY --label YOUR_AGENT_LABEL
and then again, start it using:
supervisord -c ./supervisord.conf
Note that you may need to adjust the permissions because the user who starts the supervisor process needs to be able to run this and write to supervisord.log
.
So you could e.g. start the supervisor process with root user using -u
root and you can also specify the location of your process logs with -l
.Sen
03/04/2022, 12:14 PMAnna Geller
03/04/2022, 12:14 PMcrontab
that starts this command any time you start or restart your VM - this may be useful if you ever shut down the VM (e.g. if you stop it for the night to save costs in your cloud bill) - here is how I've recently did it on Azure VM:
echo "@reboot root supervisord -c /home/azureuser/supervisord.conf -l /home/azureuser/supervisord.log -u root" >> /etc/crontab
Sen
03/04/2022, 12:33 PMAnna Geller
03/04/2022, 12:52 PMSen
03/04/2022, 12:53 PMKevin Kho
03/04/2022, 2:39 PMSen
03/04/2022, 3:23 PMKevin Kho
03/04/2022, 3:30 PMSen
03/04/2022, 3:37 PMKevin Kho
03/04/2022, 3:44 PMSen
03/04/2022, 3:45 PMprefect agent local start -e env_dict_file
Traceback (most recent call last):
File "/home/sen/anaconda3/envs/prefect_py37/bin/prefect", line 8, in <module>
sys.exit(cli())
File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/cli/agent.py", line 182, in start
start_agent(LocalAgent, import_paths=list(import_paths), **kwargs)
File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/cli/agent.py", line 130, in start_agent
env_vars = dict(e.split("=", 1) for e in env)
ValueError: dictionary update sequence element #0 has length 1; 2 is required
Anna Geller
03/04/2022, 3:47 PMSen
03/04/2022, 3:49 PMTask 'task_authenticate_and_get_keys': Exception encountered during task execution! Traceback (most recent call last): File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/engine/task_runner.py", line 880, in get_task_run_state logger=self.logger, File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/utilities/executors.py", line 467, in run_task_with_timeout return task.run(*args, **kwargs) # type: ignore File "on_prem_translationevaluator_flow/flow_local.py", line 171, in task_authenticate_and_get_keys secrets["COSMOS_WRITE_URL"] = project_odin_key_vault_client.get_secret("COSMOS-WRITE-URL").value File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/tracing/decorator.py", line 83, in wrapper_use_tracer return func(*args, **kwargs) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/keyvault/secrets/_client.py", line 72, in get_secret **kwargs File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/keyvault/secrets/_generated/_operations_mixin.py", line 1475, in get_secret return mixin_instance.get_secret(vault_base_url, secret_name, secret_version, **kwargs) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/keyvault/secrets/_generated/v7_1/operations/_key_vault_client_operations.py", line 276, in get_secret pipeline_response = self._client._pipeline.run(request, stream=False, **kwargs) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 211, in run return first_node.send(pipeline_request) # type: ignore File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) [Previous line repeated 2 more times] File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/policies/_redirect.py", line 158, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/policies/_retry.py", line 457, in send raise err File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/policies/_retry.py", line 435, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/keyvault/secrets/_shared/challenge_auth_policy.py", line 104, in send challenger = self.next.send(challenge_request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) [Previous line repeated 1 more time] File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 103, in send self._sender.send(request.http_request, **request.context.options), File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/transport/_requests_basic.py", line 285, in send raise error azure.core.exceptions.ServiceRequestError: ('Cannot connect to proxy.', OSError('Tunnel connection failed: 502 Proxy Error ( Host was not found )'))
Anna Geller
03/04/2022, 3:50 PMUniversalRun
run configSen
03/04/2022, 3:57 PMTask 'calculate_bleurt_score': Exception encountered during task execution! Traceback (most recent call last): File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/engine/task_runner.py", line 880, in get_task_run_state logger=self.logger, File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/utilities/executors.py", line 467, in run_task_with_timeout return task.run(*args, **kwargs) # type: ignore File "on_prem_translationevaluator_flow/flow_local.py", line 355, in calculate_bleurt_score scorer = score.BleurtScorer(checkpoint) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/bleurt/score.py", line 173, in __init__ self._predictor.initialize() File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/bleurt/score.py", line 63, in initialize imported = tf.saved_model.load(self.checkpoint) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load.py", line 936, in load result = load_internal(export_dir, tags, options)["root"] File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load.py", line 994, in load_internal root = load_v1_in_v2.load(export_dir, tags) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load_v1_in_v2.py", line 282, in load result = loader.load(tags=tags) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load_v1_in_v2.py", line 230, in load self.restore_variables(wrapped, restore_from_saver) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load_v1_in_v2.py", line 114, in restore_variables constant_op.constant(self._variables_path)) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1601, in __call__ return self._call_impl(args, kwargs) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/wrap_function.py", line 244, in _call_impl args, kwargs, cancellation_manager) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1619, in _call_impl return self._call_with_flat_signature(args, kwargs, cancellation_manager) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1668, in _call_with_flat_signature return self._call_flat(args, self.captured_inputs, cancellation_manager) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1854, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 504, in call ctx=ctx) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 55, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.ResourceExhaustedError: Graph execution error: 2 root error(s) found. (0) RESOURCE_EXHAUSTED: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;8ccf64172d29d439;/job:localhost/replica:0/task:0/device:GPU:0;edge_267_save/RestoreV2;0:0 [[{{node save/RestoreV2/_262}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode. [[save/RestoreV2/_403]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode. (1) RESOURCE_EXHAUSTED: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;8ccf64172d29d439;/job:localhost/replica:0/task:0/device:GPU:0;edge_267_save/RestoreV2;0:0 [[{{node save/RestoreV2/_262}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode. 0 successful operations. 0 derived errors ignored. [Op:__inference_pruned_13295]