https://prefect.io logo
Title
t

Tim Galvin

11/07/2022, 9:12 AM
What is the correct way of specifying a 'dummy' storage block for a deployment? My code that I have wrapped in prefect tasks correctly deals with absolute paths. Additionally, the data I am working with are on the order of 2TB, and are not really appropriately set up for the default local file system logic used when
--storage-block
is unset to copy. I thought I could set up a 'dummy' local file system block in the Orion UI (in my own managed server, not prefect cloud), however the
prefect deployment build
command says
'github', 's3',  'gcs', 'azure', 'smb'
are supported types. TL;DR - I need to set a
--storage-block
in my deployment, and I am reasonably certain in my situation I do not want to be copying anything to / from different file systems and blocks. I have a common underlying filesystem at the HPC center, and my data are pretty large -- large enough where I can not reasonably expect copying to and from the disk to be feasible
Just to take on, I might have a work around, which is building the deployment in a directory that is empty, so that the default local file system does not need to copy any large amounts of data. However, it feels like a little bit of a dirty workaround - and I would prefer to have something a little more explicit in the deployment yaml to make it clearer what is going on. However, letting things run I am not seeing expected log output captured and presemted in the Orion Web UI. And I have just received a httpx timeout error
Traceback (most recent call last):
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/prefect/agent.py", line 154, in get_and_submit_flow_runs
    queue_runs = await self.client.get_runs_in_work_queue(
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/prefect/client/orion.py", line 759, in get_runs_in_work_queue
    response = await <http://self._client.post|self._client.post>(
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_client.py", line 1842, in post
    return await self.request(
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_client.py", line 1527, in request
    return await self.send(request, auth=auth, follow_redirects=follow_redirects)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/prefect/client/base.py", line 159, in send
    await super().send(*args, **kwargs)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_client.py", line 1614, in send
    response = await self._send_handling_auth(
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_client.py", line 1642, in _send_handling_auth
    response = await self._send_handling_redirects(
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_client.py", line 1679, in _send_handling_redirects
    response = await self._send_single_request(request)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_client.py", line 1716, in _send_single_request
    response = await transport.handle_async_request(request)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_transports/default.py", line 353, in handle_async_request
    resp = await self._pool.handle_async_request(req)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/contextlib.py", line 137, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/group/askap/miniconda3/envs/acesprefect2/lib/python3.9/site-packages/httpx/_transports/default.py", line 77, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.RemoteProtocolError: Server disconnected without sending a response.
One step forward -- I have tweak a setting or two, rerun and rebuilt and I think I have something that is working. However, is there a way to set in the
Process
infrastructure a way to change the working directory before execution? I think this is the main bit that caught me up, in combination of needing to be careful with how the default Local File System storage block behaves. @Anna Geller -- In short: ā€¢ Is there a neat, supported way of setting no LFS storage for a deployment? My current work around is to create the
prefect deployment build
in an empty folder so nothing is copied when I
prefect deployment run
, which for my datasets is a bit of a struggle ā€¢ In relation to the previous point, my typical datasets are ~2TB, but can get as large as 50TB, and copying the data alot is not feasible. I am planning on setting up these deployments and scripts on some HPC facilities that have a plans to use just a
Process
type infrastructure component. Is there a way for this subprocess that is started to change directories to the path I set, hopefully via the deployment yaml file so that I can be explicit? The current behaviour seems to be starting the flow in a
/tmp
space.
k

Kalise Richmond

11/07/2022, 5:55 PM
Hey @Tim Galvin, have you tried using
--skip-upload
on the
prefect deployment build
command? Also there is a
command
field on the Process infra that you could override
t

Tim Galvin

11/08/2022, 1:17 AM
Hu @Kalise Richmond - thanks for the reply šŸ™‚ I have seen that
--skip-upload
argument. Although I have not tried it, the output of
prefect deployment build
said something like "There are no files to upload. Use
--skip-upload
to suppress this warning". I took that the mean it might not be the answer - as it seems like the LFS copying is copying directories on the compute infrastructure after the agent receives the flow run but before it starts the flow run. I know the data are not being uploaded to my orion instance -- simply because they are too big for the disk šŸ˜› In any case -- I will try this option and report back. About the command override. I did see that, and I have previously overwritten it when debugging some other silly Tim error. You are right there and that seems like a good idea. I could do something cheeky like
command:
- cd
- /path/to/workspace
- &&
- python
- -m
- prefect.engine
Thanks for the suggestion!
a

Anna Geller

11/09/2022, 12:58 PM
to explain: there is a good reason for executing flow runs in a temp directory (no issues with not cleaned up resources etc) and if you need the run to be executed in a specific dir, your trick could be an option, but you could also provide explicit paths in your flow or install required modules as package with setup.py
t

Tim Galvin

11/10/2022, 2:10 AM
Thank Anna -- it does make sense to me to put some things in the
/tmp
store. It was just one of these things when trying to get the deployment running initially that was not very intuitive or clear.
šŸ™Œ 1