Pandas is one of the most prominent libraries utilized by data scientists and analysts in the realm of data analysis and manipulation. It offers an extensive array of data cleansing, preparation, and transformation capabilities. One such function that will be discussed in this article is shuffle in Pandas (). This function is beneficial when you want to randomize the order of your data, as it is used to shuffle the rows of a DataFrame. In this article, you will learn how to shuffle a Pandas Dataframe rows with Python.

What is Pandas Shuffle?

Pandas shuffle() is a function used to arbitrarily reorder the rows of a DataFrame. It is used when we wish to randomize the order of our data, which is particularly essential when working with large datasets. We can circumvent any bias that may arise from the data’s order by shuffling the data.

Loading a Sample Pandas Dataframe

The Python code in the following code section generates a sample Pandas Dataframe. If you wish to follow this tutorial line by line, feel free to copy the code below in sequential sequence. You can also use your own dataframe, though the results will differ from those in the tutorial.

				
					import pandas as pd

# create a dictionary containing data for the DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 30, 35, 40, 45],
        'City': ['New York', 'Paris', 'London', 'Berlin', 'Sydney']}

# create a Pandas DataFrame from the dictionary
df = pd.DataFrame(data)

# print the DataFrame
print(df)

				
			

Output:

				
					#      Name  Age      City
#0    Alice   25  New York
#1      Bob   30     Paris
#2  Charlie   35    London
#3    David   40    Berlin
#4      Eva   45    Sydney
				
			

Shuffle a Pandas Dataframe with sample

You can shuffle a Pandas DataFrame using the sample() method with the frac parameter set to 1, which will randomly sample all rows in the DataFrame. Here is an example:

				
					import pandas as pd

# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 30, 35, 40, 45],
        'City': ['New York', 'Paris', 'London', 'Berlin', 'Sydney']}
df = pd.DataFrame(data)

# shuffle the DataFrame
shuffled_df = df.sample(frac=1)

# print the shuffled DataFrame
print(shuffled_df)

				
			

Output:

				
					#      Name  Age      City
#3    David   40    Berlin
#1      Bob   30     Paris
#0    Alice   25  New York
#2  Charlie   35    London
#4      Eva   45    Sydney
				
			

In this example, we first create a sample DataFrame. We then use the sample() method to shuffle the rows of the DataFrame, with the frac parameter set to 1 to sample all rows. Finally, we print the shuffled DataFrame using print(). The output will be the shuffled DataFrame with the rows in a random order.

Our index can be reset using the Pandas.reset index() method, which sorts our index from 0 onwards. Let’s see how this appears:

				
					import pandas as pd

# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 30, 35, 40, 45],
        'City': ['New York', 'Paris', 'London', 'Berlin', 'Sydney']}
df = pd.DataFrame(data)

# shuffle the DataFrame
shuffled_df = df.sample(frac=1)

# reset the index
shuffled_df = shuffled_df.reset_index(drop=True)

# print the shuffled and reset DataFrame
print(shuffled_df)

				
			

Output:

				
					#      Name  Age      City
#0    Alice   25  New York
#1    David   40    Berlin
#2      Bob   30     Paris
#3  Charlie   35    London
#4      Eva   45    Sydney
				
			

In this example, we first create a sample DataFrame. We then use the sample() method to shuffle the rows of the DataFrame, with the frac parameter set to 1 to sample all rows. Next, we use the reset_index() method to reset the index of the shuffled DataFrame, with the drop=True parameter to drop the old index. Finally, we print the shuffled and reset DataFrame using print(). The output will be the shuffled DataFrame with a new index sorted from 0 onwards.

Get Row Numbers that Match Multiple Condition in a Pandas Dataframe

If you want to reproduce a shuffled Pandas DataFrame with a specific random seed, you can set the random_state parameter of the sample() method.
Here’s an example:

				
					import pandas as pd

# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 30, 35, 40, 45],
        'City': ['New York', 'Paris', 'London', 'Berlin', 'Sydney']}
df = pd.DataFrame(data)

# shuffle the DataFrame with a random seed of 42
shuffled_df = df.sample(frac=1, random_state=42)

# reset the index
shuffled_df = shuffled_df.reset_index(drop=True)

# print the shuffled and reset DataFrame
print(shuffled_df)

				
			

Output:

				
					#      Name  Age      City
#0      Bob   30     Paris
#1      Eva   45    Sydney
#2  Charlie   35    London
#3    Alice   25  New York
#4    David   40    Berli
				
			

In this example, we set the random_state parameter of the sample() method to 42. This ensures that the shuffled DataFrame will always be the same as long as the random seed is the same. If you want to reproduce the same shuffled DataFrame in the future, simply use the same random seed value. The output will be the shuffled DataFrame with a new index sorted from 0 onwards, which is the same as the previous shuffled DataFrame generated with a random seed of 42.

Shuffle a Pandas Dataframe with Sci-Kit Learn’s shuffle

Another helpful way to randomize a Pandas Dataframe is to use the machine learning library, sklearn. Sci-Kit Learn is a popular machine learning library, so using its shuffle() function to shuffle a Pandas DataFrame makes it easy to integrate the shuffled data into a machine learning pipeline.

Here’s an example:

				
					import pandas as pd
from sklearn.utils import shuffle

# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 30, 35, 40, 45],
        'City': ['New York', 'Paris', 'London', 'Berlin', 'Sydney']}
df = pd.DataFrame(data)

# shuffle the DataFrame with Sci-Kit Learn's shuffle function and a random seed of 1
shuffled_df = shuffle(df, random_state=1)

# reset the index
shuffled_df = shuffled_df.reset_index(drop=True)

# print the shuffled and reset DataFrame
print(shuffled_df)

				
			

Output:

				
					#      Name  Age      City
#0    Alice   25  New York
#1  Charlie   35    London
#2      Eva   45    Sydney
#3      Bob   30     Paris
#4    David   40    Berlin
				
			

If we want reproduce our results we can use the random_state parameter.  
Here’s an example:

				
					import pandas as pd
from sklearn.utils import shuffle

# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 30, 35, 40, 45],
        'City': ['New York', 'Paris', 'London', 'Berlin', 'Sydney']}
df = pd.DataFrame(data)

# shuffle the DataFrame with Sci-Kit Learn's shuffle function and a random seed of 1
shuffled_df = shuffle(df, random_state=1)

# reset the index
shuffled_df = shuffled_df.reset_index(drop=True)

# print the shuffled and reset DataFrame
print(shuffled_df)

				
			

In this example, we pass the value 1 to the random_state parameter of the shuffle() function. This ensures that the shuffled DataFrame will always be the same as long as the random seed is 1. The output will be the shuffled DataFrame with a new index sorted from 0 onwards, which is the same as the previous shuffled DataFrame generated with a random seed of 1.

Shuffle a Pandas Dataframe with Numpy’s random.permutation

Another way to shuffle a Pandas DataFrame is to use NumPy’s random.permutation() function. This function generates a random permutation of a sequence, which can be used to shuffle the rows of a DataFrame.

Here’s an example:

				
					import pandas as pd
import numpy as np

# create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eva'],
        'Age': [25, 30, 35, 40, 45],
        'City': ['New York', 'Paris', 'London', 'Berlin', 'Sydney']}
df = pd.DataFrame(data)

# shuffle the DataFrame with NumPy's random permutation function
shuffled_df = df.iloc[np.random.permutation(len(df))].reset_index(drop=True)

# print the shuffled DataFrame
print(shuffled_df)

				
			

Output:

				
					#      Name  Age      City
#0  Charlie   35    London
#1      Bob   30     Paris
#2      Eva   45    Sydney
#3    David   40    Berlin
#4    Alice   25  New York
				
			

In this example, we use NumPy’s random.permutation() function to generate a random permutation of the row indices of the DataFrame. We then use the iloc method to select the rows of the DataFrame in the shuffled order. Finally, we reset the index to start from 0 using the reset_index() method with the drop=True parameter.

This method can be useful if you are already using NumPy in your project and don’t want to import another library like Sci-Kit Learn just for shuffling a DataFrame. However, it’s worth noting that random.permutation() function generates a new permutation every time it’s called without a random seed, so if you need to reproduce the same shuffled DataFrame in the future, you’ll need to set a random seed using NumPy’s random.seed() function.

The Fastest Way to Shuffle a Pandas Dataframe

You may be unsure of which method to choose at this juncture. I would suggest determining which method suits your workflow the best. For instance, if you are constructing a data science pipeline with sklearn, you may wish to add the shuffle utility into your pipeline.

Check this comparison below:

MethodDescriptionProsConsSpeed
sample() with inplace=TrueShuffle the DataFrame in place with the sample() methodFastest method, saves memory by not creating a new shuffled DataFrameChanges the original DataFrame, may not be suitable for some use casesFast
Sci-Kit Learn’s shuffle()Shuffle the DataFrame using Sci-Kit Learn’s shuffle() functionEasy to use, works with NumPy arrays as well as DataFramesSlower than Pandas sample() method, requires importing an additional libraryMedium
NumPy’s random.permutation()Shuffle the DataFrame using NumPy’s random.permutation() functionFast, works well if NumPy is already being used in the projectGenerates a new permutation every time it’s called, may not be reproducible without setting a random seedMedium
Pandas sample() method without inplace=TrueShuffle the DataFrame using the sample() method and creating a new shuffled DataFrameEasy to use, doesn’t change the original DataFrameCreates a new DataFrame, which can use more memory for large DataFramesSlowest

Wrap up

You learned how to shuffle a Pandas Dataframe using the Pandas sample method in this tutorial. The method permits us to randomly sample rows. To shuffle our dataframe, we merely take a random sample of the entire dataframe. Using the random state= parameter, we can even reproduce our shuffle dataframe.

You also learned how to use the sklearn and numpy libraries to shuffle your dataframe, giving you even more control over how your results are generated. Using sklearn, for instance, enables you to readily incorporate this step into machine learning pipelines.

Check out the official documentation located here to learn more about the methods outlined in this tutorial:


Thanks for reading. Happy coding!