I need to split a dataframe into N parts using num...
# ask-community
j
I need to split a dataframe into N parts using numpy.arraysplit prior to loading into a database. Is this best done with a loop in the load task or in the flow itself?
k
If your load task returns a list of DataFrames and then you want to use
map
on each part, then it seems to make sense in the load task.
But I don’t think it matters too much one way or the other
j
Thank you!
Could you provide a quick example?
k
Are you splitting on something or just even splits?
j
I believe it will be even. This would be for a weekly "full table refresh" that may be too slow for individual loads.
k
Should just be:
Copy code
df = pd.DataFrame({"a":[1,2,3,4,5,6],"b":[1,2,3,4,5,6]})
N = 3
list_dfs = np.array_split(df, N)
Or you can make a new column to control the split like this:
Copy code
df = pd.DataFrame({"a":[1,2,3,4,5,6],"b":[1,2,3,4,5,6]})
N = 3

df = df.assign(new_col=np.mod(np.arange(df.shape[0]),N))

list_dfs = []
for n in range(N):
    list_dfs.append(df.loc[df["new_col"] == n])
j
very good, thank you