Hi everyone. Does any one have experience/examples of getting a large response from a http get call,...

Marc Lipoff

12/09/2020, 2:07 PM

Hi everyone. Does any one have experience/examples of getting a large response from a http get call, and writing to s3? Prefect has the map functionality, which seems like a good way to do this. If I were not using prefect, I'd do something like this:

Copy code

s3_client = boto3.client('s3')

with requests.request("GET", url, stream=True) as r:
     r.raise_for_status()

     for chunk in r:
        s3_client.put_object(Body=chunk,
                             Bucket='xxx',
                             Key='xxx')

Sébastien

12/09/2020, 2:18 PM

If you want to fully parallelize, you'll need to fetch the size beforehand, split it into chunks and run a map on the chunk parts (create a separate stream

url<>S3

for each chunk). The

requests.request

is a single streaming request. That single stream can't be turned into pieces without limiting yourself to I/O throughput on the initial worker (AFAIK). If you want to keep it simpler yet still speed it up, you should be able to use

requests-futures

to make it async and run on multiple OS threads (which, in turn, would not guarantee chunk ordering so make sure you reconstruct it properly once the whole object is streamed).

Marc Lipoff

12/09/2020, 2:27 PM

ok awesome. what would be the best way to reconstruct it?

Sébastien

12/09/2020, 2:28 PM

Manually, step by step, by first modeling your solution before coding it and making sure it fulfills your needs.

3 Views

Open in Slack

Previous Next

Prefect Community

Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.