In the world of data analysis, data preprocessing is a vital step. One common issue that analysts encounter is the presence of duplicate rows in their datasets. In this article, we will discuss how to use the drop_duplicates method in the Pandas library to remove duplicate rows from your DataFrame. In this tutorial, you will learn how to use the Pandas drop_duplicates method to drop duplicate records in a DataFrame.

Understanding Duplicate Rows in DataFrames

Before diving into the drop_duplicates method, it’s essential to understand what duplicate rows are and why they need to be removed. Duplicate rows are identical rows in a DataFrame, and they can occur due to multiple reasons, such as data entry errors or merging multiple datasets. Removing these duplicate rows is crucial for accurate data analysis and ensuring the integrity of your data.

Let’s first take a look at the different parameters and default arguments in the Pandas .drop_duplicates() method:

				
					import pandas as pd
data = {'Column1': ['A', 'B', 'A', 'C', 'B', 'D'],
        'Column2': [1, 2, 1, 3, 2, 4]}

df = pd.DataFrame(data)
df_no_duplicates = df.drop_duplicates()

print(df)
				
			

By default, the drop_duplicates() method removes all but the first occurrence of each duplicated row, considering all columns in the DataFrame.

There are some useful parameters that you can use to customize the behavior of the .drop_duplicates() method:

  • subset: Specify one or more columns to consider when identifying duplicates. By default, it considers all columns.
				
					df_no_duplicates = df.drop_duplicates(subset=['Column1'])

				
			
  • keep: Determine which duplicate row to keep. The options are 'first' (default), 'last', or False. 'first' keeps the first occurrence, 'last' keeps the last occurrence, and False removes all duplicate rows.
				
					df_no_duplicates = df.drop_duplicates(keep='last')

				
			
  • inplace: Modify the DataFrame in place instead of returning a new one. Default is False.
				
					df.drop_duplicates(inplace=True)

				
			
  • ignore_index: Reset the index after dropping duplicates. Default is False.
				
					df_no_duplicates = df.drop_duplicates(ignore_index=True)

				
			

Using the drop_duplicates() method in combination with these parameters allows you to effectively manage and clean your datasets, ensuring that your data analysis remains accurate and efficient.

Loading a Sample Pandas DataFrame

I’ve included a sample Pandas DataFrame below so that you can follow the instruction line-by-line. Simply copy the code and paste it in your preferred code editor. Although your results will undoubtedly differ, feel free to use your own DataFrame if you have one.

				
					import pandas as pd

# Create a dictionary with some data
data = {'Name': ['John', 'Jane', 'Bob', 'Mary'],
        'Age': [25, 30, 21, 27],
        'Gender': ['Male', 'Female', 'Male', 'Female']}

# Convert the dictionary to a Pandas DataFrame
df = pd.DataFrame(data)

# Print the DataFrame
print(df)

				
			

Output:

				
					#    Name  Age  Gender
# 0  John   25    Male
# 1  Jane   30  Female
# 2   Bob   21    Male
# 3  Mary   27  Female
				
			

This will create a simple DataFrame with columns for Name, Age, and Gender, and four rows of data. You can customize the data and column names to fit your need

Using Pandas drop_duplicates to Keep the First Row

Because the keep= parameter defaults to 'first', we do not need to modify the method to behave differently. 

Here’s an example:

				
					import pandas as pd

# create a sample DataFrame with duplicate rows
df = pd.DataFrame({'Name': ['John', 'Jane', 'Bob', 'Mary', 'John'],
                   'Age': [25, 30, 21, 27, 25],
                   'Gender': ['Male', 'Female', 'Male', 'Female', 'Male']})

# keep only the first occurrence of each row based on all columns
df_first = df.drop_duplicates()

# print the original and modified DataFrames
print("Original DataFrame:")
print(df)
print("\nDataFrame with first occurrence of each row:")
print(df_first)

				
			

Output:

				
					#    Name  Age  Gender
# 0  John   25    Male
# 1  Jane   30  Female
# 2   Bob   21    Male
# 3  Mary   27  Female
				
			

In this example, the drop_duplicates method is called on the df DataFrame to remove any duplicate rows, keeping only the first occurrence of each row. The resulting DataFrame is stored in df_first.

Use Pandas drop_duplicates to Check Across Specific Columns

In some cases, you’ll only want to drop duplicate records across specific columns.  Let’s see how we can scan only across the ['Name', 'Gender'] columns

Here’s an example:

				
					import pandas as pd

# create a sample DataFrame with duplicate rows
df = pd.DataFrame({'Name': ['John', 'Jane', 'Bob', 'Mary', 'John'],
                   'Age': [25, 30, 21, 27, 25],
                   'Gender': ['Male', 'Female', 'Male', 'Female', 'Male']})

# check for duplicate rows across the 'Name' and 'Gender' columns
df_duplicates = df.drop_duplicates(subset=['Name', 'Gender'])

# print the original and modified DataFrames
print("Original DataFrame:")
print(df)
print("\nDataFrame with duplicate rows removed based on 'Name' and 'Gender':")
print(df_duplicates)

				
			

Output:

				
					# Original DataFrame:
#   Name  Age  Gender
# 0  John   25    Male
# 1  Jane   30  Female
# 2   Bob   21    Male
# 3  Mary   27  Female
# 4  John   25    Male

# DataFrame with duplicate rows removed based on 'Name' and 'Gender':
#    Name  Age  Gender
# 0  John   25    Male
# 1  Jane   30  Female
# 2   Bob   21    Male
# 3  Mary   27  Female
				
			

In this example, the drop_duplicates method is called on the df DataFrame, specifying that duplicates should be identified based on both the ‘Name’ and ‘Gender’ columns using the subset parameter. The resulting DataFrame is stored in df_duplicates.

Note that drop_duplicates checks for duplicate rows across all columns by default. By specifying the subset parameter, you can limit the check to specific columns. You can also pass multiple columns as a list to check for duplicates across more than one column.

Using Pandas drop_duplicates to Keep the Last Row

Pandas also enables you to quickly maintain the most recent duplicated record instance.
Passing keep='last' into the method will change this behavior.

Here’s an example:

				
					import pandas as pd

# create a sample DataFrame with duplicate rows
df = pd.DataFrame({'Name': ['John', 'Jane', 'Bob', 'Mary', 'John'],
                   'Age': [25, 30, 21, 27, 25],
                   'Gender': ['Male', 'Female', 'Male', 'Female', 'Male']})

# keep only the last occurrence of each row based on all columns
df_last = df.drop_duplicates(keep='last')

# print the original and modified DataFrames
print("Original DataFrame:")
print(df)
print("\nDataFrame with last occurrence of each row:")
print(df_last)

				
			

Output:

				
					# Original DataFrame:
#   Name  Age  Gender
# 0  John   25    Male
# 1  Jane   30  Female
# 2   Bob   21    Male
# 3  Mary   27  Female
# 4  John   25    Male

# DataFrame with last occurrence of each row:
#   Name  Age  Gender
# 1  Jane   30  Female
# 2   Bob   21    Male
# 3  Mary   27  Female
# 4  John   25    Male
				
			

In this example, the drop_duplicates method is called on the df DataFrame to remove any duplicate rows, keeping only the last occurrence of each row. The resulting DataFrame is stored in df_last.

Note that the keep parameter is set to 'last' to indicate that the last occurrence of each row should be kept. By default, keep is set to 'first', so you don’t need to specify it if you want to keep the first occurrence of each row.

How to Remove All Duplicate Rows in Pandas

Sometimes it’s better to discard every duplicate record rather than just the first or last one. You can also control this by passing False when using the keep= argument.

				
					import pandas as pd

# create a sample DataFrame with duplicate rows
df = pd.DataFrame({'Name': ['John', 'Jane', 'Bob', 'Mary', 'John'],
                   'Age': [25, 30, 21, 27, 25],
                   'Gender': ['Male', 'Female', 'Male', 'Female', 'Male']})

# remove all duplicate rows
df_unique = df.drop_duplicates(keep=False)

# print the original and modified DataFrames
print("Original DataFrame:")
print(df)
print("\nDataFrame with all duplicate rows removed:")
print(df_unique)

				
			

Output:

				
					# Original DataFrame:
#   Name  Age  Gender
# 0  John   25    Male
# 1  Jane   30  Female
# 2   Bob   21    Male
# 3  Mary   27  Female
# 4  John   25    Male

# DataFrame with all duplicate rows removed:
#   Name  Age  Gender
# 1  Jane   30  Female
# 2   Bob   21    Male
# 3  Mary   27  Female
				
			

In this example, the drop_duplicates method is called on the df DataFrame with the keep parameter set to False to remove all duplicate rows, regardless of whether they appear once or multiple times in the DataFrame. The resulting DataFrame, with only unique rows, is stored in df_unique.

Note that by default, keep is set to 'first', so you don’t need to specify it if you want to keep the first occurrence of each row. If you want to keep the last occurrence of each row, you can set keep to 'last'.

Use Pandas drop_duplicates to Keep Row with Max Value

Pandas doesn’t provide out-of-the-box functionality to keep a row with the maximum value in a column. Here’s an example of how to use the drop_duplicates method in Pandas to keep only the row with the maximum value in a specific column and remove any duplicate rows:

				
					import pandas as pd

# create a sample DataFrame with duplicate rows
df = pd.DataFrame({'Name': ['John', 'Jane', 'Bob', 'Mary', 'John'],
                   'Age': [25, 30, 21, 27, 25],
                   'Gender': ['Male', 'Female', 'Male', 'Female', 'Male']})

# keep only the row with the maximum age
df_max_age = df.loc[df.groupby('Name')['Age'].idxmax()]

# print the original and modified DataFrames
print("Original DataFrame:")
print(df)
print("\nDataFrame with row with max age kept for each unique name:")
print(df_max_age)

				
			

Output:

				
					# Original DataFrame:
#    Name  Age  Gender
# 0  John   25    Male
# 1  Jane   30  Female
# 2   Bob   21    Male
# 3  Mary   27  Female
# 4  John   25    Male

# DataFrame with row with max age kept for each unique name:
#   Name  Age  Gender
# 2   Bob   21    Male
# 1  Jane   30  Female
# 0  John   25    Male
# 3  Mary   27  Female
				
			

In this example, the drop_duplicates method is not used directly. Instead, the loc method is used to select only the rows with the maximum age for each unique name. This is done by grouping the DataFrame by the ‘Name’ column and then using the idxmax method to get the index of the row with the maximum age for each group. The resulting index is used to select the corresponding rows from the DataFrame. The resulting DataFrame, with only one row per unique name, is stored in df_max_age.

Note that this approach works only if you have a unique maximum value for each group. If there are ties, this method will only keep the first occurrence of the maximum value.

Also note that this approach can be adapted to keep the row with the minimum value by using the idxmin method instead of idxmax.

Use Pandas to Remove Duplicate Records In Place

Alternatively, duplicate records may be dropped in place using the Pandas .drop_duplicates() method. Since nothing is returned, the DataFrame is updated. By assigning the DataFrame to itself in earlier stages, we were able to remove duplicate records.

Here’s an example:

				
					import pandas as pd

# create a sample DataFrame with duplicate rows
df = pd.DataFrame({'Name': ['John', 'Jane', 'Bob', 'Mary', 'John'],
                   'Age': [25, 30, 21, 27, 25],
                   'Gender': ['Male', 'Female', 'Male', 'Female', 'Male']})

# remove all duplicate rows in place
df.drop_duplicates(inplace=True)

# print the modified DataFrame
print("DataFrame with all duplicate rows removed in place:")
print(df)

				
			

Output:

				
					# DataFrame with all duplicate rows removed in place:
#   Name  Age  Gender
# 0  John   25    Male
# 1  Jane   30  Female
# 2   Bob   21    Male
# 3  Mary   27  Female
				
			

In this example, the drop_duplicates method is called on the df DataFrame with the inplace parameter set to True to remove all duplicate rows and modify the original DataFrame directly. The resulting modified DataFrame, with only unique rows, is printed.

Note that when using inplace=True, the method does not return a new DataFrame. Instead, it modifies the original DataFrame directly, so there is no need to assign the result to a new variable.

Also note that modifying the DataFrame in place can be useful for large datasets where creating a new DataFrame would be memory-intensive. However, it’s always a good idea to make a copy of the original DataFrame before modifying it in case you need to revert to the original data.

How to Reset an Index When Dropping Duplicate Records in Pandas

The Pandas .drop_duplicates() method allows us to drop duplicates while simultaneously resetting the index of the resulting DataFrame.

Here’s an example:

				
					import pandas as pd

# create a sample DataFrame with duplicate rows
df = pd.DataFrame({'Name': ['John', 'Jane', 'Bob', 'Mary', 'John'],
                   'Age': [25, 30, 21, 27, 25],
                   'Gender': ['Male', 'Female', 'Male', 'Female', 'Male']})

# remove all duplicate rows and reset the index
df_unique = df.drop_duplicates().reset_index(drop=True)

# print the modified DataFrame
print("DataFrame with all duplicate rows removed and index reset:")
print(df_unique)

				
			

Output:

				
					#    Name  Age  Gender
# 0  John   25    Male
# 1  Jane   30  Female
# 2   Bob   21    Male
# 3  Mary   27  Female
				
			

In this example, the drop_duplicates method is called on the df DataFrame to remove all duplicate rows. The resulting modified DataFrame with only unique rows is stored in df_unique.

Note that when you remove duplicate rows, the resulting DataFrame may have gaps in the index if any rows were removed. To reset the index so that it starts at 0 and increments sequentially, you can use the reset_index method with the drop parameter set to True. The drop parameter is set to True to drop the old index column and create a new one with a sequential index.

The resulting DataFrame with the index reset is stored in df_unique, and it is printed to the console to show the modified DataFrame.

Wrap up

In this article we explored various ways to utilize the drop_duplicates function in Pandas for eliminating duplicate rows from a DataFrame.

We demonstrated techniques for retaining the first or last instance of each distinct row, identifying duplicates based on specific columns, removing all duplicate rows, and keeping only the row with the highest or lowest value in a particular column.

Additionally, we illustrated how to delete duplicate records in place and how to reorganize the index of a DataFrame after the removal of duplicate rows.

In summary, the drop_duplicates function in Pandas is an effective tool for refining and adjusting data to ensure that each row in a DataFrame is distinct and fulfills your criteria.

To learn more about Pandas check out the official documentation here :
https://pandas.pydata.org/pandas-docs/stable/reference/series.html


Thanks for reading. Happy coding!