Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.

Prefect Community

Hi all,

I have written a Flow to perform an ETL process. Throughout the course of the Flow thousands of API requests are being made (using the .map function). Consequently, my Flow takes over 24 hours to make the API calls and parse all the data. Also, the Flow is using a local Dask executor and it’s running on an ECS container. Does anyone have any ideas on how I might be able to shave some time off my Flow execution?

Hi <@U01TLB4KXPS>! What is the bottleneck? Parsing or the API calls?

<@U01QEJ9PP53> The bottleneck is in the parsing.
```results_json = get_results_json.map(id_list, unmapped(BEARER_TOKEN), unmapped(BASE_URL), unmapped(HEADERS))
results = parse_rubric_results_json(results_json)```
I have two functions here. `get_rubric_results` , which takes about an hour to complete, and `parse_rubric_results`, which takes about 15 hours. `parse_rubric_results`  takes in a list of json objects, iterates through the list and picks out the specific fields we are looking for in the json, and it then appends those fields to a data frame, and finally returns the data frame in the end.

Just throwing out ideas but I think you can split these into two separate flows orchestrated by a bigger flow. I say this because you can give two separate hardware specs to each of those flows and devote more hardware to the parsing subflow. Of course, this approach requires some re-write to pass the data around across flows.

This is if your flow gets benefit from paralellization. Are you using Pandas DataFrame? I believe appending is a slow operation that might be the bottleneck. How are you doing it?

You're right, I am appending to a pandas data frame. Maybe there's a better way to do this

It might be a lot faster for pandas if you create one list per column. Append the values to the lists. Create the dataframe from those lists at the end `pd.DataFrame({'col1': list1, 'col2':list2})`

`df1 = df1.append(df2)` inside a loop is painfully slow

Maybe `pd.read_json` can speed up your operation as well by reading the data in at once and then doing vectorized operations on the DataFrame