https://prefect.io logo
Title
a

Aaron Gonzalez

02/16/2023, 10:53 PM
👋 everyone! QQ. I am getting ready to build a little deployment that gsutil rsync's a bunch of data from s3 to gcs. Unfortunately the data in s3 doesn't really follow a nice partition hierarchy so i can't just
gsutil rsync <s3://some-key/dt=yyyy-mm-dd/>
<gs://some-key/dt=yyyy-mm-dd/>
😢 I am going to give
prefect-shell
a try for the first time and want to know if people have had a lot of experience with it? For my use case I have about 12K different rsync's I am going to need to run and I don't know which of these patterns is preferable:
for src in s3_sources_12k:
    dest = f'<gs://some-dest/{src}>'
    ShellOperation(
        commands=[f"gsutil rsync -r {src} {dest}"],
        env=env_var_map,
    ).run()
or
with ShellOperation(
    commands=[
        "gsutil rsync -r src1 dest1",
        "gsutil rsync -r src2 dest2",
        "gsutil rsync -r src3 dest3",
        ...
        "gsutil rsync -r src12k dest12k",
    ],
    env=env_var_map,
) as shell_operation:
    shell_process = shell_operation.trigger()
    shell_process.wait_for_completion()
    shell_output = shell_process.fetch_result()
The docs state
For long-lasting operations, use the trigger method and utilize the block as a context manager for automatic closure of processes when context is exited
But I don't know if "long-lasting" refers to the total amount of iterations I might need to make or the amount of time each one is going to take (probably not that long because the data is pretty small).
a

Andrew Huang

02/16/2023, 11:55 PM
Thanks for bringing this up! I think the first one with each op wrapped in a task would work.
@task
def sync_src(src):
    dest = f'<gs://some-dest/{src}>'
    ShellOperation(
        commands=[f"gsutil rsync -r {src} {dest}"],
        env=env_var_map,
    ).run()
    return

for src in s3_sources_12k:
    sync_src(src)
Perhaps you can submit an issue regarding this so we can have others jump into the discussion?
a

Aaron Gonzalez

02/17/2023, 2:50 PM
Thanks for the suggestion @Andrew Huang. In my original snippet I didn't share that I already have the entire for loop already inside of a single task call. You think creating ~12k tasks would be better than having 1 task that iterates through the list?
a

Andrew Huang

02/17/2023, 8:16 PM
Depends on whether you want visibility / retries on the tasks and how long each sync takes. If it takes milliseconds, then using tasks probably will incur a large overhead
a

Aaron Gonzalez

02/17/2023, 8:27 PM
good point 🤔
👀 1