Dataframes are a fundamental data structure in data science, and Pandas is one of the most widely used libraries for data manipulation in Python. Splitting a Pandas dataframe can be a crucial step in data analysis, and there are several ways to achieve this.

In this article, we will cover various techniques to split a Pandas DataFrame effectively, enhancing your data analysis skills and providing you with the tools to tackle real-world problems.

Why Split a Pandas DataFrame?

Splitting a DataFrame can be useful in many scenarios, such as:

  1. Data cleaning and preprocessing: Dividing the dataset into smaller, more manageable pieces to efficiently apply data transformations.
  2. Feature engineering: Creating new features based on subsets of the data.
  3. Model training and evaluation: Partitioning the data into training, validation, and test sets for machine learning algorithms.

Loading a Sample Dataframe

Please feel free to load the example Pandas dataframe given below if you want to follow the lesson along with it. Although some particular examples may require further customization for your situation, feel free to use your own data if you have it.

Start the process by loading some data!

				
					import pandas as pd
#create dataframe
data = {'Name': ['John', 'Alice', 'Bob', 'Emily'],
        'Age': [25, 30, 35, 40],
        'Gender': ['M', 'F', 'M', 'F'],
        'Country': ['USA', 'Canada', 'UK', 'Australia']}

df = pd.DataFrame(data)
#print dataframe 
print(df)

				
			

Output:

				
					#    Name  Age Gender    Country
# 0   John   25      M        USA
# 1  Alice   30      F     Canada
# 2    Bob   35      M         UK
# 3  Emily   40      F  Australia

				
			

This will create a dataframe with four columns: “Name”, “Age”, “Gender”, and “Country”, and four rows with some example data.

Method 1: Split a Pandas Dataframe by Column Value

To split a Pandas dataframe by column value and store the resulting dataframes in a dictionary:

				
					import pandas as pd

# Create example dataframe
data = {'Name': ['John', 'Alice', 'Bob', 'Emily'],
        'Age': [25, 30, 35, 40],
        'Gender': ['M', 'F', 'M', 'F']}
df = pd.DataFrame(data)

# Split dataframe by Gender and store resulting dataframes in a dictionary
dfs = {k: v for k, v in df.groupby('Gender')}

# Print resulting dataframes
for k, v in dfs.items():
    print(f"Gender: {k}")
    print(v)

				
			

Output:

				
					# Gender: F
#    Name  Age Gender
# 1  Alice   30      F
# 3  Emily   40      F
# Gender: M
#   Name  Age Gender
# 0  John   25      M
# 2   Bob   35      M
				
			

This code will create a dictionary dfs where each key is a unique value in the “Gender” column and the value is a dataframe containing all the rows with that value. It then loops through the dictionary and prints out each dataframe along with its corresponding key.

Method 2: Get All Groups of a Dataframe by Value

To get all groups of a Pandas dataframe by a specific column value and print out the resulting dataframes, all in one block of code:

				
					import pandas as pd

# Create example dataframe
data = {'Name': ['John', 'Alice', 'Bob', 'Emily'],
        'Age': [25, 30, 35, 40],
        'Gender': ['M', 'F', 'M', 'F']}
df = pd.DataFrame(data)

# Group dataframe by Gender
grouped = df.groupby('Gender')

# Loop through all groups and print resulting dataframes
for name, group in grouped:
    print(f"Gender: {name}")
    print(group)

				
			

Output:

				
					# Gender: F
#    Name  Age Gender
# 1  Alice   30      F
# 3  Emily   40      F
# Gender: M
#   Name  Age Gender
# 0  John   25      M
# 2   Bob   35      M
				
			

This code will group the dataframe by the “Gender” column using the groupby() function, and then loop through all the resulting groups using a for loop. For each group, it will print out the group’s unique value in the “Gender” column, and the corresponding dataframe containing all the rows with that value.

Method 3: Split a Pandas Dataframe by Position

 Here’s an example of how to split a Pandas dataframe by position and store the resulting dataframes in variables, all in one block of code:

				
					import pandas as pd

# Create example dataframe
data = {'Name': ['John', 'Alice', 'Bob', 'Emily'],
        'Age': [25, 30, 35, 40],
        'Gender': ['M', 'F', 'M', 'F']}
df = pd.DataFrame(data)

# Split dataframe by position
df1 = df.iloc[:, :2]  # Select first two columns
df2 = df.iloc[:, 2:]  # Select last column

# Print resulting dataframes
print("DataFrame 1:")
print(df1)

print("DataFrame 2:")
print(df2)

				
			

Output:

				
					# DataFrame 1:
#    Name  Age
# 0   John   25
# 1  Alice   30
# 2    Bob   35
# 3  Emily   40
# DataFrame 2:
#  Gender
# 0      M
# 1      F
# 2      M
# 3      F
				
			

This code will create a dataframe df with three columns, and then use the iloc function to select the first two columns and store them in a new dataframe df1, and the last column in a new dataframe df2. It then prints out both dataframes using the print() function.

Method 4: Split a Pandas Dataframe into Random Values

Here’s an example of how to split a Pandas dataframe into random values and store the resulting dataframes in variables, all in one code block:

				
					import pandas as pd
import numpy as np

# Create example dataframe
data = {'Name': ['John', 'Alice', 'Bob', 'Emily'],
        'Age': [25, 30, 35, 40],
        'Gender': ['M', 'F', 'M', 'F']}
df = pd.DataFrame(data)

# Shuffle dataframe rows
df = df.sample(frac=1, random_state=42)  # Shuffle rows using a fixed random state for reproducibility

# Split dataframe into two random subsets
frac1 = 0.7  # Fraction of rows to include in first subset
df1 = df.sample(frac=frac1, random_state=42)  # Sample rows without replacement using a fixed random state
df2 = df.drop(df1.index)  # Drop rows in df1 from original dataframe

# Print resulting dataframes
print("DataFrame 1:")
print(df1)

print("DataFrame 2:")
print(df2)

				
			

Output:

				
					# DataFrame 1:
#    Name  Age Gender
# 3  Emily   40      F
# 2    Bob   35      M
# 1  Alice   30      F
# DataFrame 2:
#   Name  Age Gender
# 0  John   25      M
				
			

This code first creates a dataframe df with three columns. It then shuffles the rows of the dataframe using the sample() function with a frac argument of 1 (which means to sample all rows) and a fixed random state of 42 for reproducibility.

Next, it splits the shuffled dataframe into two random subsets: df1 and df2. df1 contains 70% of the rows (specified by the frac1 variable), and is sampled without replacement from the shuffled dataframe using the sample() function with the same random state of 42. df2 contains the remaining 30% of the rows, and is created by dropping the rows in df1 from the original dataframe using the drop() function.

Finally, the code prints out both resulting dataframes using the print() function.

Method 5: Using MultiIndex to Split DataFrames

In some cases, you might need to split a DataFrame based on multiple levels of grouping. Pandas supports this functionality through MultiIndex. Here’s an example of how to use MultiIndex to split a DataFrame:

				
					import pandas as pd
# Create a sample DataFrame with multi-level data
data = {'Category1': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
        'Category2': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'],
        'Value': [1, 2, 3, 4, 5, 6, 7, 8]}
df = pd.DataFrame(data)

# Set MultiIndex for the DataFrame
df = df.set_index(['Category1', 'Category2'])

# Split the DataFrame using MultiIndex
grouped = df.groupby(level=['Category1', 'Category2'])

for name, group in grouped:
    print(f"Group: {name}")
    print(group)

				
			

Output: 

				
					#                      Value
# Category1 Category2       
# A         X              1
#          X              5
# Group: ('A', 'Y')
#                     Value
# Category1 Category2       
# A         Y              2
#          Y              6
# Group: ('B', 'X')
#                     Value
# Category1 Category2       
# B         X              3
#          X              7
# Group: ('B', 'Y')
#                     Value
# Category1 Category2       
# B         Y              4
#          Y              8

				
			

Method 6: Split-Apply-Combine Pattern

When you need to perform complex operations on subsets of your data, the split-apply-combine pattern comes in handy. This pattern involves splitting the data into groups, applying a function to each group, and then combining the results. Here’s an example using Pandas:

				
					import numpy as np
import pandas as pd

# Create a sample DataFrame with multi-level data
data = {'Category1': ['A', 'A', 'B', 'B', 'A', 'A', 'B', 'B'],
        'Category2': ['X', 'Y', 'X', 'Y', 'X', 'Y', 'X', 'Y'],
        'Value': [1, 2, 3, 4, 5, 6, 7, 8]}
df = pd.DataFrame(data)

# Define a custom aggregation function
def custom_agg(group):
    return np.sum(group['Value']) / len(group)

# Apply the split-apply-combine pattern
result = df.groupby(['Category1', 'Category2']).apply(custom_agg)
print(result)

				
			

Output: 

				
					# Category1  Category2
# A          X            3.0
#           Y            4.0
# B          X            5.0
#           Y            6.0
# dtype: float64
				
			

Wrap up

In this article, we explored various methods to split a Pandas DataFrame into smaller DataFrames based on different criteria, such as row index, column name, condition, and multi-level grouping. We also introduced the split-apply-combine pattern, which can help you perform complex operations on grouped data. By mastering these techniques, you’ll be better equipped to handle large datasets and perform advanced data analysis using Pandas.


Thanks for reading. Happy coding!