In this article, we will discuss various techniques for dropping one or more columns in Pandas, a widely-used Python library for data analysis and manipulation. By the end of this article, you will have a solid understanding of the different methods available to you for removing unnecessary columns from your DataFrame and how to apply these methods in your data analysis workflow.
Understanding the Basics: DataFrame and Columns
Before diving into the actual process of dropping columns, it’s essential to understand the basic concepts of a DataFrame and its columns. A DataFrame is a two-dimensional tabular data structure in Pandas, consisting of rows and columns. Each column in a DataFrame represents a particular attribute or feature, while rows correspond to individual observations or records.
Why Drop Columns in Pandas?
There are several reasons why you might want to drop columns in a Pandas DataFrame. Some of these reasons include:
- Irrelevant data: Some columns may not be relevant to your analysis or the problem you are trying to solve.
- Redundancy: Some columns may contain duplicate or highly correlated information, which can lead to multicollinearity issues in certain machine learning algorithms.
- Data size reduction: Dropping columns can help reduce the overall size of the dataset, making it easier to work with and store.
Loading a Sample Python Pandas DataFrame
I’ve included a sample Pandas DataFrame below so that you can follow the instruction line-by-line. Simply copy the code and paste it in your preferred code editor. Although your results will undoubtedly differ, feel free to use your own DataFrame if you have one.
import pandas as pd
# create sample data
data = {'Name': ['John', 'Alice', 'Bob', 'Karen', 'Mike'],
'Age': [25, 32, 18, 47, 22],
'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney'],
'Salary': [60000, 80000, 40000, 120000, 55000]}
# create DataFrame
df = pd.DataFrame(data)
# print DataFrame
print(df)
Output:
# Name Age City Salary
# 0 John 25 New York 60000
# 1 Alice 32 Paris 80000
# 2 Bob 18 London 40000
# 3 Karen 47 Tokyo 120000
# 4 Mike 22 Sydney 55000
This code will create a DataFrame with four columns – ‘Name’, ‘Age’, ‘City’, and ‘Salary’ – and five rows of data. You can modify the values in the ‘data’ dictionary to create your own DataFrame with the desired values.
How to Drop a Pandas Column by Name
To drop a column by name from the DataFrame created in your example code, you can use the drop()
method with the axis=1
parameter.
Here’s how you can do it:
import pandas as pd
# create sample data
data = {'Name': ['John', 'Alice', 'Bob', 'Karen', 'Mike'],
'Age': [25, 32, 18, 47, 22],
'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney'],
'Salary': [60000, 80000, 40000, 120000, 55000]}
# create DataFrame
df = pd.DataFrame(data)
# print DataFrame
print("Original DataFrame:")
print(df)
# drop 'City' column by name
df = df.drop('City', axis=1)
# print updated DataFrame
print("DataFrame after dropping 'City' column:")
print(df)
Output:
# Original DataFrame:
# Name Age City Salary
# 0 John 25 New York 60000
# 1 Alice 32 Paris 80000
# 2 Bob 18 London 40000
# 3 Karen 47 Tokyo 120000
# 4 Mike 22 Sydney 55000
# DataFrame after dropping 'City' column:
# Name Age Salary
# 0 John 25 60000
# 1 Alice 32 80000
# 2 Bob 18 40000
# 3 Karen 47 120000
# 4 Mike 22 55000
In this code, the drop()
method is used to remove the ‘City’ column from the DataFrame df
. The axis=1
parameter specifies that the operation should be performed on columns. The updated DataFrame is then printed to show that the ‘City’ column has been dropped.
How to Drop Pandas Columns by Name In Place
To drop Pandas columns by name in place, you can use the drop()
method with the axis=1
parameter and set the inplace
parameter to True
. The inplace
parameter specifies whether to update the DataFrame in place or return a new DataFrame with the dropped columns.
Here’s how you can do it:
import pandas as pd
# create sample data
data = {'Name': ['John', 'Alice', 'Bob', 'Karen', 'Mike'],
'Age': [25, 32, 18, 47, 22],
'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney'],
'Salary': [60000, 80000, 40000, 120000, 55000]}
# create DataFrame
df = pd.DataFrame(data)
# print DataFrame
print("Original DataFrame:")
print(df)
# drop 'Age' column by name in place
df.drop('Age', axis=1, inplace=True)
# print updated DataFrame
print("DataFrame after dropping 'Age' column in place:")
print(df)
Output:
# Original DataFrame:
# Name Age City Salary
# 0 John 25 New York 60000
# 1 Alice 32 Paris 80000
# 2 Bob 18 London 40000
# 3 Karen 47 Tokyo 120000
# 4 Mike 22 Sydney 55000
#DataFrame after dropping 'Age' column in place:
# Name City Salary
# 0 John New York 60000
# 1 Alice Paris 80000
# 2 Bob London 40000
# 3 Karen Tokyo 120000
# 4 Mike Sydney 55000
In this code, the drop()
method is used to remove the ‘Age’ column from the DataFrame df
. The axis=1
parameter specifies that the operation should be performed on columns, and inplace=True
ensures that the original DataFrame is updated in place. The updated DataFrame is then printed to show that the ‘Age’ column has been dropped.
Note that after dropping columns in place, you don’t need to reassign the result of the drop()
method to the DataFrame variable as the original DataFrame has already been modified.
How to Drop Multiple Pandas Columns by Names
You can drop several columns by name using the list of columns to drop argument when using the Pandas DataFrame.drop() function. When using this strategy, you can either:
1. Pass a list of columns into the labels= parameter and use index=1.
2. Pass a list of columns into the columns= parameter.
By dropping the “Age” and “City” columns, let’s demonstrate how you may use the.drop() function to drop multiple columns by name:
import pandas as pd
# create sample data
data = {'Name': ['John', 'Alice', 'Bob', 'Karen', 'Mike'],
'Age': [25, 32, 18, 47, 22],
'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney'],
'Salary': [60000, 80000, 40000, 120000, 55000]}
# create DataFrame
df = pd.DataFrame(data)
# print DataFrame
print("Original DataFrame:")
print(df)
# drop 'Age' and 'City' columns by name
df.drop(['Age', 'City'], axis=1, inplace=True)
# print updated DataFrame
print("DataFrame after dropping 'Age' and 'City' columns:")
print(df)
Output:
# Original DataFrame:
# Name Age City Salary
# 0 John 25 New York 60000
# 1 Alice 32 Paris 80000
# 2 Bob 18 London 40000
# 3 Karen 47 Tokyo 120000
# 4 Mike 22 Sydney 55000
#DataFrame after dropping 'Age' and 'City' columns:
# Name Salary
# 0 John 60000
# 1 Alice 80000
# 2 Bob 40000
# 3 Karen 120000
# 4 Mike 55000
In this code, the drop()
method is used to remove the ‘Age’ and ‘City’ columns from the DataFrame df
. The column names are passed as a list to the drop()
method, and the axis=1
parameter specifies that the operation should be performed on columns. inplace=True
ensures that the original DataFrame is updated in place. The updated DataFrame is then printed to show that the ‘Age’ and ‘City’ columns have been dropped.
Note that after dropping columns in place, you don’t need to reassign the result of the drop()
method to the DataFrame variable as the original DataFrame has already been modified.
How to Drop a Pandas Column by Position/Index
By utilizing the .drop()
function, a Pandas column can be dropped according to its location (or index). You may use the technique to retrieve columns based on where they are in the index.
The df.columns
property, which gives a list-like structure of all the columns in the DataFrame, is used to do this. From there, you may choose the columns you wish to remove using list slicing.
Let’s look at an example where the second column is removed to clarify how this works:
import pandas as pd
# create sample data
data = {'Name': ['John', 'Alice', 'Bob', 'Karen', 'Mike'],
'Age': [25, 32, 18, 47, 22],
'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney'],
'Salary': [60000, 80000, 40000, 120000, 55000]}
# create DataFrame
df = pd.DataFrame(data)
# print DataFrame
print("Original DataFrame:")
print(df)
# drop the second column by index (i.e., 'Age')
df.drop(df.columns[1], axis=1, inplace=True)
# print the updated dataframe
print("DataFrame after dropping second column by index:")
print(df)
Output:
# Original DataFrame:
# Name Age City Salary
# 0 John 25 New York 60000
# 1 Alice 32 Paris 80000
# 2 Bob 18 London 40000
# 3 Karen 47 Tokyo 120000
# 4 Mike 22 Sydney 55000
# DataFrame after dropping second column by index:
# Name City Salary
# 0 John New York 60000
# 1 Alice Paris 80000
# 2 Bob London 40000
# 3 Karen Tokyo 120000
# 4 Mike Sydney 55000
In this code, the drop()
method is used to remove the second column (i.e., ‘Age’) from the DataFrame df
. The columns
parameter is used to specify the column index, which in this case is df.columns[1]
. The axis=1
parameter specifies that the operation should be performed on columns, and inplace=True
ensures that the original DataFrame is updated in place. The updated DataFrame is then printed to show that the second column has been dropped.
Note that after dropping columns in place, you don’t need to reassign the result of the drop()
method to the DataFrame variable as the original DataFrame has already been modified.
How to Drop Multiple Pandas Columns by Position/Index
Like what was seen in the previous section, dropping multiple columns by position or index works similarly. The df.columns attribute’s.drop() function allows you to access columns by their location.
Dropping the first and second columns of our DataFrame will help to better show this:
import pandas as pd
# create sample data
data = {'Name': ['John', 'Alice', 'Bob', 'Karen', 'Mike'],
'Age': [25, 32, 18, 47, 22],
'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney'],
'Salary': [60000, 80000, 40000, 120000, 55000]}
# create DataFrame
df = pd.DataFrame(data)
# print DataFrame
print("Original DataFrame:")
print(df)
# drop the second and third columns by index (i.e., 'Age' and 'City')
df.drop(df.columns[[1, 2]], axis=1, inplace=True)
# print the updated dataframe
print("DataFrame after dropping second and third columns by index:")
print(df)
Output:
# Original DataFrame:
# Name Age City Salary
# 0 John 25 New York 60000
# 1 Alice 32 Paris 80000
# 2 Bob 18 London 40000
# 3 Karen 47 Tokyo 120000
# 4 Mike 22 Sydney 55000
# DataFrame after dropping second and third columns by index:
# Name Salary
# 0 John 60000
# 1 Alice 80000
# 2 Bob 40000
# 3 Karen 120000
# 4 Mike 55000
In this code, the drop()
method is used to remove the second and third columns (i.e., ‘Age’ and ‘City’) from the DataFrame df
. The columns
parameter is used to specify a list of column indexes, which in this case is df.columns[[1, 2]]
. The axis=1
parameter specifies that the operation should be performed on columns, and inplace=True
ensures that the original DataFrame is updated in place. The updated DataFrame is then printed to show that the second and third columns have been dropped.
Note that after dropping columns in place, you don’t need to reassign the result of the drop()
method to the DataFrame variable as the original DataFrame has already been modified.
How to Drop a Pandas Column If It Exists
To drop a Pandas column if it exists, you can first check if the column exists using the in
operator with the columns
attribute of the DataFrame. Here’s an example code snippet to drop the column ‘Age’ if it exists in the DataFrame created by your code:
import pandas as pd
# create sample data
data = {'Name': ['John', 'Alice', 'Bob', 'Karen', 'Mike'],
'Age': [25, 32, 18, 47, 22],
'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney'],
'Salary': [60000, 80000, 40000, 120000, 55000]}
# create DataFrame
df = pd.DataFrame(data)
# print DataFrame
print("Original DataFrame:")
print(df)
# drop the column 'Age' if it exists
if 'Age' in df.columns:
df.drop('Age', axis=1, inplace=True)
print("DataFrame after dropping column 'Age':")
print(df)
else:
print("Column 'Age' does not exist in the DataFrame.")
Output:
# Original DataFrame:
# Name Age City Salary
# 0 John 25 New York 60000
# 1 Alice 32 Paris 80000
# 2 Bob 18 London 40000
# 3 Karen 47 Tokyo 120000
# 4 Mike 22 Sydney 55000
# DataFrame after dropping column 'Age':
# Name City Salary
# 0 John New York 60000
# 1 Alice Paris 80000
# 2 Bob London 40000
# 3 Karen Tokyo 120000
# 4 Mike Sydney 55000
In this code, the if
statement checks if the column ‘Age’ exists in the DataFrame df
. If it does, the drop()
method is used to remove the column ‘Age’ from the DataFrame df
. The axis=1
parameter specifies that the operation should be performed on columns, and inplace=True
ensures that the original DataFrame is updated in place. The updated DataFrame is then printed to show that the column ‘Age’ has been dropped. If the column ‘Age’ does not exist in the DataFrame, a message is printed indicating that it does not exist.
Note that after dropping columns in place, you don’t need to reassign the result of the drop()
method to the DataFrame variable as the original DataFrame has already been modified.
How to Drop Pandas Columns by Condition
To drop Pandas columns by condition, you can use boolean indexing to select the columns you want to drop based on a certain condition. Here’s an example code snippet to drop all columns where the column name starts with ‘S’:
import pandas as pd
# create sample data
data = {'Name': ['John', 'Alice', 'Bob', 'Karen', 'Mike'],
'Age': [25, 32, 18, 47, 22],
'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney'],
'Salary': [60000, 80000, 40000, 120000, 55000]}
# create DataFrame
df = pd.DataFrame(data)
# print DataFrame
print("Original DataFrame:")
print(df)
# create a boolean index for selecting columns
drop_cols = [col for col in df.columns if col.startswith('S')]
# drop the selected columns in place
df.drop(drop_cols, axis=1, inplace=True)
# print the updated DataFrame
print("DataFrame after dropping columns by condition:")
print(df)
Output:
# Original DataFrame:
# Name Age City Salary
# 0 John 25 New York 60000
# 1 Alice 32 Paris 80000
# 2 Bob 18 London 40000
# 3 Karen 47 Tokyo 120000
# 4 Mike 22 Sydney 55000
# DataFrame after dropping columns by condition:
# Name Age City
# 0 John 25 New York
# 1 Alice 32 Paris
# 2 Bob 18 London
# 3 Karen 47 Tokyo
# 4 Mike 22 Sydney
In this code, the drop_cols
list comprehension creates a boolean index by selecting all column names that start with the letter ‘S’. The drop()
method is then used to remove the columns selected by the boolean index from the DataFrame df
. The axis=1
parameter specifies that the operation should be performed on columns, and inplace=True
ensures that the original DataFrame is updated in place. The updated DataFrame is then printed to show that the columns starting with ‘S’ have been dropped.
Note that after dropping columns in place, you don’t need to reassign the result of the drop()
method to the DataFrame variable as the original DataFrame has already been modified.
How to Drop Pandas Columns Containing Missing Values
To drop Pandas columns containing missing values, you can use the dropna()
method on the DataFrame with the axis=1
parameter to indicate that the operation should be performed on columns. Here’s an example code snippet to drop all columns that contain at least one missing value:
import pandas as pd
import numpy as np
# create sample data with missing values
data = {'Name': ['John', 'Alice', 'Bob', 'Karen', 'Mike'],
'Age': [25, np.nan, 18, 47, 22],
'City': ['New York', 'Paris', 'London', np.nan, 'Sydney'],
'Salary': [60000, 80000, 40000, 120000, np.nan]}
# create DataFrame
df = pd.DataFrame(data)
# print DataFrame
print("Original DataFrame:")
print(df)
# drop columns containing missing values in place
df.dropna(axis=1, inplace=True)
# print the updated DataFrame
print("DataFrame after dropping columns with missing values:")
print(df)
Output:
# Original DataFrame:
# Name Age City # Salary
# 0 John 25.0 New York 60000.0
# 1 Alice NaN Paris 80000.0
# 2 Bob 18.0 London 40000.0
# 3 Karen 47.0 NaN 120000.0
# 4 Mike 22.0 Sydney NaN
# DataFrame after dropping columns with missing values:
# Name
# 0 John
# 1 Alice
# 2 Bob
# 3 Karen
# 4 Mike
In this code, the dropna()
method with axis=1
parameter is used to remove all columns that contain at least one missing value from the DataFrame df
. The inplace=True
parameter ensures that the original DataFrame is updated in place. The updated DataFrame is then printed to show that the columns containing missing values have been dropped.
Note that this code will drop all columns that contain at least one missing value. If you want to drop only specific columns that contain missing values, you can first use boolean indexing to create a boolean index for selecting the columns that contain missing values, and then pass that boolean index to the drop()
method to remove those specific columns.
Dropping Pandas Columns Where a Number of Records Are Missing
The thresh=
argument allows you to put in an integer that represents the minimum number of records that must be non-empty in a particular column, which is useful if you want to remove columns that have a certain number of missing data from them.
Take a look at how we may remove columns that have at least two missing values:
import pandas as pd
import numpy as np
# create sample data with missing values
data = {'Name': ['John', 'Alice', 'Bob', 'Karen', 'Mike'],
'Age': [25, np.nan, 18, np.nan, 22],
'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney'],
'Salary': [np.nan, 80000, 40000, np.nan, 55000]}
# create DataFrame
df = pd.DataFrame(data)
# print DataFrame
print("Original DataFrame:")
print(df)
# count the number of missing values per column
missing_values = df.isna().sum()
# print the missing values
print("Missing values per column:")
print(missing_values)
# drop columns containing at least two missing values in place
df.dropna(axis=1, thresh=len(df) - 1 - 2, inplace=True)
# print the updated DataFrame
print("DataFrame after dropping columns with at least two missing values:")
print(df)
Output:
Original DataFrame:
Name Age City Salary
0 John 25.0 New York NaN
1 Alice NaN Paris 80000.0
2 Bob 18.0 London 40000.0
3 Karen NaN Tokyo NaN
4 Mike 22.0 Sydney 55000.0
Missing values per column:
Name 0
Age 2
City 0
Salary 2
dtype: int64
DataFrame after dropping columns with at least two missing values:
Name Age City Salary
0 John 25.0 New York NaN
1 Alice NaN Paris 80000.0
2 Bob 18.0 London 40000.0
3 Karen NaN Tokyo NaN
In this code, the isna()
method is used to create a Boolean mask indicating which cells in the DataFrame contain missing values. The sum()
method is then used to count the number of missing values per column. The missing values are printed to the console to show which cells are missing.
The dropna()
method is then used to remove all columns where the number of missing values exceeds a certain threshold. The axis=1
parameter indicates that the operation should be performed on columns. The thresh
parameter is set to len(df) - 1 - 2
to indicate that a column should be kept only if it has at least len(df) - 1 - 2
non-missing values. The inplace=True
parameter ensures that the original DataFrame is updated in place. The updated DataFrame is then printed to show that the columns containing at least two missing values have been dropped.
Dropping Pandas Columns Where a Percentage of Records Are Missing
The proportion of missing values could be a preferable metric to use when determining this criterion. If 50% or more of the values in a column were missing, you might choose to remove those columns.
To drop Pandas columns where a percentage of records are missing, you can use the isna()
method to check which cells contain missing values and mean()
method to calculate the percentage of missing values per column.
Here’s an example:
import pandas as pd
import numpy as np
# create sample data with missing values
data = {'Name': ['John', 'Alice', 'Bob', 'Karen', 'Mike'],
'Age': [25, np.nan, 18, np.nan, 22],
'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney'],
'Salary': [np.nan, 80000, 40000, np.nan, 55000]}
# create DataFrame
df = pd.DataFrame(data)
# print DataFrame
print("Original DataFrame:")
print(df)
# calculate the percentage of missing values per column
missing_values_percentage = df.isna().mean() * 100
# print the missing values percentage
print("Missing values percentage per column:")
print(missing_values_percentage)
# drop columns with a missing values percentage greater than 50%
missing_values_threshold = 50.0
columns_to_drop = missing_values_percentage[missing_values_percentage > missing_values_threshold].index
df.drop(columns_to_drop, axis=1, inplace=True)
# print the updated DataFrame
print("DataFrame after dropping columns with a missing values percentage greater than 50%:")
print(df)
Output:
# Original DataFrame:
# Name Age City Salary
# 0 John 25.0 New York NaN
# 1 Alice NaN Paris 80000.0
# 2 Bob 18.0 London 40000.0
# 3 Karen NaN Tokyo NaN
# 4 Mike 22.0 Sydney 55000.0
# Missing values percentage per column:
# Name 0.0
# Age 40.0
# City 0.0
# Salary 40.0
# dtype: float64
# DataFrame after dropping columns with a missing values percentage greater than 50%:
# Name Age City Salary
# 0 John 25.0 New York NaN
# 1 Alice NaN Paris 80000.0
# 2 Bob 18.0 London 40000.0
# 3 Karen NaN Tokyo NaN
# 4 Mike 22.0 Sydney 55000.0
In this code, the isna()
method is used to create a Boolean mask indicating which cells in the DataFrame contain missing values. The mean()
method is then used to calculate the percentage of missing values per column. The missing values percentage is printed to the console to show which columns have a high percentage of missing values.
The code then drops columns where the percentage of missing values exceeds a certain threshold. The missing_values_threshold
variable is set to 50.0% to indicate that a column should be dropped if its percentage of missing values exceeds 50.0%. The columns_to_drop
variable is used to identify the columns that should be dropped, and the drop()
method is used to remove them. The axis=1
parameter indicates that the operation should be performed on columns. The inplace=True
parameter ensures that the original DataFrame is updated in place. The updated DataFrame is then printed to show that the columns with a missing values percentage greater than 50% have been dropped.
How to Pop Pandas Columns
To pop Pandas columns, you can use the pop()
method. The pop()
method is used to remove a column from the DataFrame and return it as a Series.
Here’s an example:
import pandas as pd
# create sample data
data = {'Name': ['John', 'Alice', 'Bob', 'Karen', 'Mike'],
'Age': [25, 32, 18, 47, 22],
'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney'],
'Salary': [60000, 80000, 40000, 120000, 55000]}
# create DataFrame
df = pd.DataFrame(data)
# print DataFrame
print("Original DataFrame:")
print(df)
# pop the 'Age' column from the DataFrame
age_column = df.pop('Age')
# print the updated DataFrame and the popped column
print("DataFrame after popping the 'Age' column:")
print(df)
print("Popped 'Age' column:")
print(age_column)
Output:
# Original DataFrame:
# Name Age City Salary
# 0 John 25 New York 60000
# 1 Alice 32 Paris 80000
# 2 Bob 18 London 40000
# 3 Karen 47 Tokyo 120000
# 4 Mike 22 Sydney 55000
#DataFrame after popping the 'Age' column:
# Name City Salary
# 0 John New York 60000
# 1 Alice Paris 80000
# 2 Bob London 40000
# 3 Karen Tokyo 120000
# 4 Mike Sydney 55000
#
# Popped 'Age' column:
# 0 25
# 1 32
# 2 18
# 3 47
# 4 22
# Name: Age, dtype: int64
In this code, the pop()
method is used to remove the ‘Age’ column from the DataFrame and return it as a Series. The age_column
variable is used to store the popped column for later use. The updated DataFrame is then printed to show that the ‘Age’ column has been removed. Finally, the popped column is printed to the console to show that it has been returned as a Series.
Wrap up
You discovered some helpful techniques in this tutorial for removing columns from DataFrames using Pandas. Both data scientists and analysts should be proficient with working with DataFrames. Your workflow can be made much simpler and more efficient by being aware of the immense versatility that Pandas provides when working with columns!
To learn more check out Pandas .drop()
method – official documentation
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.drop.html
Thanks for reading. Happy coding!