Hi all, I have written a Flow to perform an ETL p...
# ask-community
j
Hi all, I have written a Flow to perform an ETL process. Throughout the course of the Flow thousands of API requests are being made (using the .map function). Consequently, my Flow takes over 24 hours to make the API calls and parse all the data. Also, the Flow is using a local Dask executor and it’s running on an ECS container. Does anyone have any ideas on how I might be able to shave some time off my Flow execution?
k
Hi @Jacob Wilson! What is the bottleneck? Parsing or the API calls?
j
@Kevin Kho The bottleneck is in the parsing.
Copy code
results_json = get_results_json.map(id_list, unmapped(BEARER_TOKEN), unmapped(BASE_URL), unmapped(HEADERS))
results = parse_rubric_results_json(results_json)
I have two functions here.
get_rubric_results
, which takes about an hour to complete, and
parse_rubric_results
, which takes about 15 hours.
parse_rubric_results
takes in a list of json objects, iterates through the list and picks out the specific fields we are looking for in the json, and it then appends those fields to a data frame, and finally returns the data frame in the end.
k
Just throwing out ideas but I think you can split these into two separate flows orchestrated by a bigger flow. I say this because you can give two separate hardware specs to each of those flows and devote more hardware to the parsing subflow. Of course, this approach requires some re-write to pass the data around across flows.
This is if your flow gets benefit from paralellization. Are you using Pandas DataFrame? I believe appending is a slow operation that might be the bottleneck. How are you doing it?
j
You're right, I am appending to a pandas data frame. Maybe there's a better way to do this
k
It might be a lot faster for pandas if you create one list per column. Append the values to the lists. Create the dataframe from those lists at the end
pd.DataFrame({'col1': list1, 'col2':list2})
df1 = df1.append(df2)
inside a loop is painfully slow
Maybe
pd.read_json
can speed up your operation as well by reading the data in at once and then doing vectorized operations on the DataFrame