I need to split a dataframe into N parts using numpy.arraysplit prior to loading into a database. Is this best done with a loop in the load task or in the flow itself?
k
Kevin Kho
12/09/2021, 4:14 PM
If your load task returns a list of DataFrames and then you want to use
map
on each part, then it seems to make sense in the load task.
Kevin Kho
12/09/2021, 4:14 PM
But I don’t think it matters too much one way or the other
j
Jason Motley
12/09/2021, 4:15 PM
Thank you!
Jason Motley
12/09/2021, 4:18 PM
Could you provide a quick example?
k
Kevin Kho
12/09/2021, 4:31 PM
Are you splitting on something or just even splits?
j
Jason Motley
12/09/2021, 4:32 PM
I believe it will be even. This would be for a weekly "full table refresh" that may be too slow for individual loads.
k
Kevin Kho
12/09/2021, 4:39 PM
Should just be:
Copy code
df = pd.DataFrame({"a":[1,2,3,4,5,6],"b":[1,2,3,4,5,6]})
N = 3
list_dfs = np.array_split(df, N)
Kevin Kho
12/09/2021, 4:40 PM
Or you can make a new column to control the split like this:
Copy code
df = pd.DataFrame({"a":[1,2,3,4,5,6],"b":[1,2,3,4,5,6]})
N = 3
df = df.assign(new_col=np.mod(np.arange(df.shape[0]),N))
list_dfs = []
for n in range(N):
list_dfs.append(df.loc[df["new_col"] == n])
Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.