Hi everyone. Does any one have experience/examples of getting a large response from a http get call, and writing to s3? Prefect has the map functionality, which seems like a good way to do this. If I were not using prefect, I'd do something like this:
Copy code
s3_client = boto3.client('s3')
with requests.request("GET", url, stream=True) as r:
r.raise_for_status()
for chunk in r:
s3_client.put_object(Body=chunk,
Bucket='xxx',
Key='xxx')
s
Sébastien
12/09/2020, 2:18 PM
If you want to fully parallelize, you'll need to fetch the size beforehand, split it into chunks and run a map on the chunk parts (create a separate stream
url<>S3
for each chunk).
The
requests.request
is a single streaming request. That single stream can't be turned into pieces without limiting yourself to I/O throughput on the initial worker (AFAIK).
If you want to keep it simpler yet still speed it up, you should be able to use
requests-futures
to make it async and run on multiple OS threads (which, in turn, would not guarantee chunk ordering so make sure you reconstruct it properly once the whole object is streamed).
m
Marc Lipoff
12/09/2020, 2:27 PM
ok awesome. what would be the best way to reconstruct it?
s
Sébastien
12/09/2020, 2:28 PM
Manually, step by step, by first modeling your solution before coding it and making sure it fulfills your needs.
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.