Sen
03/04/2022, 5:30 AMhttps://pasteboard.co/kpsNsqpN3aXP.png▾
Sen
03/04/2022, 5:31 AMSen
03/04/2022, 5:52 AMdocker run --rm --gpus all --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock --env-file ./.env --runtime=nvidia nvidia_docker_agent:latestSen
03/04/2022, 5:55 AMdocker run --rm --gpus all --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock --env-file ./.env --runtime=nvidia nvidia_docker_agent:latest nvidia-smi
https://pasteboard.co/j4OV6UW4YoY5.png▾
Anna Geller
Sen
03/04/2022, 8:55 AMAnna Geller
prefect agent docker start --label GPU --key CLOUD_API_KEY
and then check if the flow run containers that get spun up can utilize the GPU resourcesSen
03/04/2022, 8:58 AMnvidia/cuda:11.4.0-runtime-ubuntu20.04Sen
03/04/2022, 8:59 AMAnna Geller
Sen
03/04/2022, 9:05 AM# way - 1
prefect agent docker start --label GPU
# way - 2
prefect agent local start --label GPUSen
03/04/2022, 9:07 AMAnna Geller
prefect agent local start --helpSen
03/04/2022, 9:08 AMAnna Geller
Sen
03/04/2022, 9:09 AMAnna Geller
host_config args on the DockerRun to make that work - e.g. here I can see a config called device_requests which you may have to configure + they may be some extra work to make that work - I don't have enough experience with GPU-based workloads to say for sureSen
03/04/2022, 9:27 AMAnna Geller
Sen
03/04/2022, 11:52 AMAnna Geller
nohup, even though we usually recommend using supervisor.
For supervisor, we have docs on how to implement that here.
Basically, run this command:
prefect agent local install --key API_KEY --label YOUR_AGENT_LABEL > supervisord.conf
and it generates the command you can use to start the process:
supervisord -c ./supervisord.confAnna Geller
supervisord.conf with this content:
[unix_http_server]
file=/tmp/supervisor.sock ; the path to the socket file
[supervisord]
loglevel=debug ; log level; default info; others: debug,warn,trace
[rpcinterface:supervisor]
supervisor.rpcinterface_factory = supervisor.rpcinterface:make_main_rpcinterface
[supervisorctl]
serverurl=unix:///tmp/supervisor.sock ; use a unix:// URL for a unix socket
[program:prefect-agent]
command=/Users/yourName/opt/anaconda3/envs/yourVenvName/bin/prefect agent local start --key API_KEY --label YOUR_AGENT_LABEL
and then again, start it using:
supervisord -c ./supervisord.conf
Note that you may need to adjust the permissions because the user who starts the supervisor process needs to be able to run this and write to supervisord.log.
So you could e.g. start the supervisor process with root user using -u root and you can also specify the location of your process logs with -l .Sen
03/04/2022, 12:14 PMAnna Geller
Anna Geller
crontab that starts this command any time you start or restart your VM - this may be useful if you ever shut down the VM (e.g. if you stop it for the night to save costs in your cloud bill) - here is how I've recently did it on Azure VM:
echo "@reboot root supervisord -c /home/azureuser/supervisord.conf -l /home/azureuser/supervisord.log -u root" >> /etc/crontabAnna Geller
Sen
03/04/2022, 12:33 PMAnna Geller
Sen
03/04/2022, 12:53 PMKevin Kho
Kevin Kho
Sen
03/04/2022, 3:23 PMKevin Kho
Kevin Kho
Sen
03/04/2022, 3:37 PMSen
03/04/2022, 3:38 PMKevin Kho
Sen
03/04/2022, 3:45 PMSen
03/04/2022, 3:45 PMSen
03/04/2022, 3:45 PMSen
03/04/2022, 3:47 PMprefect agent local start -e env_dict_fileSen
03/04/2022, 3:47 PMTraceback (most recent call last):
File "/home/sen/anaconda3/envs/prefect_py37/bin/prefect", line 8, in <module>
sys.exit(cli())
File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1128, in __call__
return self.main(*args, **kwargs)
File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1053, in main
rv = self.invoke(ctx)
File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1659, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 1395, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/click/core.py", line 754, in invoke
return __callback(*args, **kwargs)
File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/cli/agent.py", line 182, in start
start_agent(LocalAgent, import_paths=list(import_paths), **kwargs)
File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/cli/agent.py", line 130, in start_agent
env_vars = dict(e.split("=", 1) for e in env)
ValueError: dictionary update sequence element #0 has length 1; 2 is requiredSen
03/04/2022, 3:47 PMAnna Geller
Sen
03/04/2022, 3:49 PMTask 'task_authenticate_and_get_keys': Exception encountered during task execution! Traceback (most recent call last): File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/engine/task_runner.py", line 880, in get_task_run_state logger=self.logger, File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/utilities/executors.py", line 467, in run_task_with_timeout return task.run(*args, **kwargs) # type: ignore File "on_prem_translationevaluator_flow/flow_local.py", line 171, in task_authenticate_and_get_keys secrets["COSMOS_WRITE_URL"] = project_odin_key_vault_client.get_secret("COSMOS-WRITE-URL").value File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/tracing/decorator.py", line 83, in wrapper_use_tracer return func(*args, **kwargs) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/keyvault/secrets/_client.py", line 72, in get_secret **kwargs File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/keyvault/secrets/_generated/_operations_mixin.py", line 1475, in get_secret return mixin_instance.get_secret(vault_base_url, secret_name, secret_version, **kwargs) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/keyvault/secrets/_generated/v7_1/operations/_key_vault_client_operations.py", line 276, in get_secret pipeline_response = self._client._pipeline.run(request, stream=False, **kwargs) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 211, in run return first_node.send(pipeline_request) # type: ignore File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) [Previous line repeated 2 more times] File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/policies/_redirect.py", line 158, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/policies/_retry.py", line 457, in send raise err File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/policies/_retry.py", line 435, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/keyvault/secrets/_shared/challenge_auth_policy.py", line 104, in send challenger = self.next.send(challenge_request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 71, in send response = self.next.send(request) [Previous line repeated 1 more time] File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/_base.py", line 103, in send self._sender.send(request.http_request, **request.context.options), File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/azure/core/pipeline/transport/_requests_basic.py", line 285, in send raise error azure.core.exceptions.ServiceRequestError: ('Cannot connect to proxy.', OSError('Tunnel connection failed: 502 Proxy Error ( Host was not found )'))Anna Geller
Anna Geller
UniversalRun run configSen
03/04/2022, 3:57 PMSen
03/04/2022, 7:30 PMTask 'calculate_bleurt_score': Exception encountered during task execution! Traceback (most recent call last): File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/engine/task_runner.py", line 880, in get_task_run_state logger=self.logger, File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/prefect/utilities/executors.py", line 467, in run_task_with_timeout return task.run(*args, **kwargs) # type: ignore File "on_prem_translationevaluator_flow/flow_local.py", line 355, in calculate_bleurt_score scorer = score.BleurtScorer(checkpoint) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/bleurt/score.py", line 173, in __init__ self._predictor.initialize() File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/bleurt/score.py", line 63, in initialize imported = tf.saved_model.load(self.checkpoint) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load.py", line 936, in load result = load_internal(export_dir, tags, options)["root"] File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load.py", line 994, in load_internal root = load_v1_in_v2.load(export_dir, tags) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load_v1_in_v2.py", line 282, in load result = loader.load(tags=tags) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load_v1_in_v2.py", line 230, in load self.restore_variables(wrapped, restore_from_saver) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/saved_model/load_v1_in_v2.py", line 114, in restore_variables constant_op.constant(self._variables_path)) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1601, in __call__ return self._call_impl(args, kwargs) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/wrap_function.py", line 244, in _call_impl args, kwargs, cancellation_manager) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1619, in _call_impl return self._call_with_flat_signature(args, kwargs, cancellation_manager) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1668, in _call_with_flat_signature return self._call_flat(args, self.captured_inputs, cancellation_manager) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1854, in _call_flat ctx, args, cancellation_manager=cancellation_manager)) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 504, in call ctx=ctx) File "/home/sen/anaconda3/envs/prefect_py37/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 55, in quick_execute inputs, attrs, num_outputs) tensorflow.python.framework.errors_impl.ResourceExhaustedError: Graph execution error: 2 root error(s) found. (0) RESOURCE_EXHAUSTED: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;8ccf64172d29d439;/job:localhost/replica:0/task:0/device:GPU:0;edge_267_save/RestoreV2;0:0 [[{{node save/RestoreV2/_262}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode. [[save/RestoreV2/_403]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode. (1) RESOURCE_EXHAUSTED: SameWorkerRecvDone unable to allocate output tensor. Key: /job:localhost/replica:0/task:0/device:CPU:0;8ccf64172d29d439;/job:localhost/replica:0/task:0/device:GPU:0;edge_267_save/RestoreV2;0:0 [[{{node save/RestoreV2/_262}}]] Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info. This isn't available when running in Eager mode. 0 successful operations. 0 derived errors ignored. [Op:__inference_pruned_13295]Sen
03/04/2022, 7:33 PMBring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.
Powered by