I m running a prefect flow that moves a large amount of file Prefect Community #ask-community

I’m running a prefect flow that moves a large amou...

Josh

01/22/2021, 10:25 PM

I’m running a prefect flow that moves a large amount of files from s3 to google cloud storage. When there are a relatively small amount of files to transfer, it’s quick. But as the files to transfer grow and especially if file size is large, the prefect process takes up increasingly large amounts of memory. I am using the

S3Download

and

GCSUpload

tasks. My suspicion is that the flow is not releasing the memory of the files being transferred. Is there any way to ensure the file contents are being released from memory?

nicholas

01/22/2021, 10:34 PM

Hi @Josh - can you give some information about how you're running your flow and on what version of Prefect?

Spencer

01/22/2021, 10:35 PM

I would suggest not using the S3Download task as it loads the file into memory. Write a simple

@task

that

boto3.client('s3').download_file()

to disk and then

blob.upload_from_filename()

. Highly suggest using the built-in

tempfile

library to simplify management of disk space.

Copy code

with tempfile.NamedTemporaryFile('w+') as f:
  s3_client.download_fileobj(bucket, key, f)

  f.flush()
  f.seek(0)

  blob.upload_from_filename(f.name)

🎯 1

upvote 1

Spencer

01/22/2021, 10:36 PM

For these types of flows, I find that it works best to do the entire transfer (per file) in a single task rather than relying on Prefect to pass file objects around as Results.

☝️ 1

nicholas

01/22/2021, 10:37 PM

@Spencer is correct, the S3Download task uses io to stream files in-memory; this is usually fine for smaller transfers but can become taxing when dealing with large files. If you want to keep the atomicity of your tasks you can download the file in one task and return a reference to it in your downstream task (that you can then upload or perform operations on as normal)

163 Views

Open in Slack

Previous Next