I need to split a dataframe into N parts using numpy.arraysplit prior to loading into a database. Is...

Jason Motley

12/09/2021, 4:07 PM

I need to split a dataframe into N parts using numpy.arraysplit prior to loading into a database. Is this best done with a loop in the load task or in the flow itself?

Kevin Kho

12/09/2021, 4:14 PM

If your load task returns a list of DataFrames and then you want to use

map

on each part, then it seems to make sense in the load task.

Kevin Kho

12/09/2021, 4:14 PM

But I don’t think it matters too much one way or the other

Jason Motley

12/09/2021, 4:15 PM

Thank you!

Jason Motley

12/09/2021, 4:18 PM

Could you provide a quick example?

Kevin Kho

12/09/2021, 4:31 PM

Are you splitting on something or just even splits?

Jason Motley

12/09/2021, 4:32 PM

I believe it will be even. This would be for a weekly "full table refresh" that may be too slow for individual loads.

Kevin Kho

12/09/2021, 4:39 PM

Should just be:

Copy code

df = pd.DataFrame({"a":[1,2,3,4,5,6],"b":[1,2,3,4,5,6]})
N = 3
list_dfs = np.array_split(df, N)

Kevin Kho

12/09/2021, 4:40 PM

Or you can make a new column to control the split like this:

Copy code

df = pd.DataFrame({"a":[1,2,3,4,5,6],"b":[1,2,3,4,5,6]})
N = 3

df = df.assign(new_col=np.mod(np.arange(df.shape[0]),N))

list_dfs = []
for n in range(N):
    list_dfs.append(df.loc[df["new_col"] == n])

Jason Motley

12/09/2021, 4:40 PM

very good, thank you

2 Views

Open in Slack

Previous Next

Prefect Community

Bring your towel and join one of the fastest growing data communities. Welcome to our second-generation open source orchestration platform, a completely rethought approach to dataflow automation.