https://prefect.io logo
j

Josh

01/22/2021, 10:25 PM
I’m running a prefect flow that moves a large amount of files from s3 to google cloud storage. When there are a relatively small amount of files to transfer, it’s quick. But as the files to transfer grow and especially if file size is large, the prefect process takes up increasingly large amounts of memory. I am using the
S3Download
and
GCSUpload
tasks. My suspicion is that the flow is not releasing the memory of the files being transferred. Is there any way to ensure the file contents are being released from memory?
n

nicholas

01/22/2021, 10:34 PM
Hi @Josh - can you give some information about how you're running your flow and on what version of Prefect?
s

Spencer

01/22/2021, 10:35 PM
I would suggest not using the S3Download task as it loads the file into memory. Write a simple
@task
that
boto3.client('s3').download_file()
to disk and then
blob.upload_from_filename()
. Highly suggest using the built-in
tempfile
library to simplify management of disk space.
Copy code
with tempfile.NamedTemporaryFile('w+') as f:
  s3_client.download_fileobj(bucket, key, f)

  f.flush()
  f.seek(0)

  blob.upload_from_filename(f.name)
🎯 1
upvote 1
For these types of flows, I find that it works best to do the entire transfer (per file) in a single task rather than relying on Prefect to pass file objects around as Results.
☝️ 1
n

nicholas

01/22/2021, 10:37 PM
@Spencer is correct, the S3Download task uses io to stream files in-memory; this is usually fine for smaller transfers but can become taxing when dealing with large files. If you want to keep the atomicity of your tasks you can download the file in one task and return a reference to it in your downstream task (that you can then upload or perform operations on as normal)
8 Views