Josh

    Josh

    1 year ago
    I’m running a prefect flow that moves a large amount of files from s3 to google cloud storage. When there are a relatively small amount of files to transfer, it’s quick. But as the files to transfer grow and especially if file size is large, the prefect process takes up increasingly large amounts of memory. I am using the
    S3Download
    and
    GCSUpload
    tasks. My suspicion is that the flow is not releasing the memory of the files being transferred. Is there any way to ensure the file contents are being released from memory?
    nicholas

    nicholas

    1 year ago
    Hi @Josh - can you give some information about how you're running your flow and on what version of Prefect?
    s

    Spencer

    1 year ago
    I would suggest not using the S3Download task as it loads the file into memory. Write a simple
    @task
    that
    boto3.client('s3').download_file()
    to disk and then
    blob.upload_from_filename()
    . Highly suggest using the built-in
    tempfile
    library to simplify management of disk space.
    with tempfile.NamedTemporaryFile('w+') as f:
      s3_client.download_fileobj(bucket, key, f)
    
      f.flush()
      f.seek(0)
    
      blob.upload_from_filename(f.name)
    For these types of flows, I find that it works best to do the entire transfer (per file) in a single task rather than relying on Prefect to pass file objects around as Results.
    nicholas

    nicholas

    1 year ago
    @Spencer is correct, the S3Download task uses io to stream files in-memory; this is usually fine for smaller transfers but can become taxing when dealing with large files. If you want to keep the atomicity of your tasks you can download the file in one task and return a reference to it in your downstream task (that you can then upload or perform operations on as normal)