Pandas is a powerful open-source data analysis and data manipulation library for Python. It provides robust data structures like Series and DataFrame that make handling large datasets more efficient and intuitive. In this article, we will explore how to create new columns in Pandas, enabling you to better organize, analyze, and manipulate your data.
Why Creating New Columns in Pandas is Essential
Creating new columns in your Pandas DataFrame is a crucial step in data processing, as it allows you to:
- Transform existing data: Apply functions or operations to existing columns to generate new information.
- Aggregate data: Combine information from multiple columns to create a summary column.
- Categorize data: Assign categories or labels to data based on certain conditions or criteria.
- Calculate metrics: Derive new insights or metrics from the data to support decision-making.
Loading a Sample Python Pandas DataFrame
I’ve included a sample Pandas DataFrame below so that you can follow the instruction line-by-line. Simply copy the code and paste it in your preferred code editor. Although your results will undoubtedly differ, feel free to use your own DataFrame if you have one.
import pandas as pd
# create a dictionary of lists
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 32, 18, 47],
'gender': ['F', 'M', 'M', 'M'],
'city': ['New York', 'Paris', 'London', 'Tokyo']}
# create a DataFrame from the dictionary
df = pd.DataFrame(data)
# print the DataFrame
print(df)
Output:
# name age gender city
# 0 Alice 25 F New York
# 1 Bob 32 M Paris
# 2 Charlie 18 M London
# 3 David 47 M Tokyo
In this example, we first create a dictionary data
with four keys ‘name’, ‘age’, ‘gender’, and ‘city’, and their corresponding values as lists. Then we create a DataFrame df
by passing the dictionary to the pd.DataFrame
function. Finally, we print the DataFrame df
to see the result.
Assign Multiple Values to a Column in Pandas
Here is an example of how to assign multiple values to a column in Pandas:
import pandas as pd
# create a dictionary of lists
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 32, 18, 47],
'gender': ['F', 'M', 'M', 'M'],
'city': ['New York', 'Paris', 'London', 'Tokyo']}
# create a DataFrame from the dictionary
df = pd.DataFrame(data)
# assign multiple values to column 'city'
df['city'] = ['NY', 'Paris', 'LDN', 'TKY']
print(df)
Output:
# name age gender city
# 0 Alice 25 F NY
# 1 Bob 32 M Paris
# 2 Charlie 18 M LDN
# 3 David 47 M TKY
In this example, we create a DataFrame df
from a dictionary of lists with four columns ‘name’, ‘age’, ‘gender’, and ‘city’. We then assign multiple values to column ‘city’ using a list of values and the column name ‘city’. Note that the list of values should have the same length as the number of rows in the DataFrame. Finally, we print the DataFrame df
to see the result.
You can also assign multiple values to a column based on some condition using the loc
method. Here’s an example:
# assign multiple values to column 'city' based on the values in column 'age'
df.loc[df['age'] < 20, 'city'] = ['NA', 'NA']
df.loc[df['age'] >= 20, 'city'] = ['NYC', 'Paris', 'LDN', 'TKY']
print(df)
Output:
# name age gender city
# 0 Alice 25 F NYC
# 1 Bob 32 M Paris
# 2 Charlie 18 M NA
# 3 David 47 M TKY
In this example, we assign multiple values to column ‘city’ based on the values in column ‘age’. If the value in column ‘age’ is less than 20, we assign two values ‘NA’ to column ‘city’, and if it’s greater than or equal to 20, we assign four values ‘NYC’, ‘Paris’, ‘LDN’, and ‘TKY’ to column ‘city’. We use the loc
method to select the rows and columns we want to assign values to.
Calculate a New Column in Pandas
Here’s an example of how to calculate a new column in Pandas based on existing columns:
import pandas as pd
# create a dictionary of lists
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 32, 18, 47],
'gender': ['F', 'M', 'M', 'M'],
'city': ['New York', 'Paris', 'London', 'Tokyo']}
# create a DataFrame from the dictionary
df = pd.DataFrame(data)
# calculate a new column 'age_in_months'
df['age_in_months'] = df['age'] * 12
print(df)
Output:
# name age gender city age_in_months
# 0 Alice 25 F New York 300
# 1 Bob 32 M Paris 384
# 2 Charlie 18 M London 216
# 3 David 47 M Tokyo 564
In this example, we first create a DataFrame df
from a dictionary of lists with four columns ‘name’, ‘age’, ‘gender’, and ‘city’. We then calculate a new column ‘age_in_months’ based on the values in column ‘age’ by multiplying it with 12, which represents the number of months in a year. We use the multiplication operator (*
) to perform the calculation and assign the result to the new column ‘age_in_months’. Finally, we print the DataFrame df
to see the result.
You can also calculate a new column based on multiple existing columns or using more complex mathematical operations.
Here’s an example:
# calculate a new column 'body_mass_index' based on 'weight' and 'height'
df['weight_kg'] = df['weight'] / 2.20462
df['height_m'] = df['height'] / 39.3701
df['body_mass_index'] = df['weight_kg'] / (df['height_m'] ** 2)
print(df)
Output:
# Index Height(Inches) Weight(Pounds) weight_kg height_m body_mass_index
# 0 1 65.78 112.99 51.251848 1.669556 18.391795
# 1 2 71.52 136.49 61.909600 1.815727 18.751180
# 2 3 69.40 153.03 69.398766 1.760021 22.307196
# 3 4 68.22 142.34 64.639814 1.732943 21.469178
# 4 5 67.79 144.30 65.454115 1.721801 22.078433
# .. ... ... ... ... ... ...
# 195 196 65.80 120.84 54.812475 1.670832 19.702101
# 196 197 66.11 115.78 52.497796 1.677018 18.731678
# 197 198 68.24 128.30 58.168483 1.734482 19.320306
# 198 199 68.02 127.47 57.809421
Add or Subtract Columns in Pandas
Similar to calculating a new column in Pandas, you can add or subtract (or multiple and divide) columns in Pandas.
Here’s an example of how to add or subtract columns in Pandas:
import pandas as pd
# create a dictionary of lists
data = {'name': ['Alice', 'Bob', 'Charlie', 'David'],
'age': [25, 32, 18, 47],
'gender': ['F', 'M', 'M', 'M'],
'city': ['New York', 'Paris', 'London', 'Tokyo']}
# create a DataFrame from the dictionary
df = pd.DataFrame(data)
# add a new column 'age_next_year' by adding a constant value to column 'age'
df['age_next_year'] = df['age'] + 1
# subtract two columns 'age' and 'age_next_year'
df['age_difference'] = df['age_next_year'] - df['age']
print(df)
Output:
# name age gender city age_next_year age_difference
# 0 Alice 25 F New York 26 1
# 1 Bob 32 M Paris 33 1
# 2 Charlie 18 M London 19 1
# 3 David 47 M Tokyo 48 1
In this example, we create a DataFrame df
from a dictionary of lists with four columns ‘name’, ‘age’, ‘gender’, and ‘city’. We then add a new column ‘age_next_year’ by adding a constant value of 1 to column ‘age’. We use the addition operator (+
) to perform the calculation and assign the result to the new column ‘age_next_year’. Finally, we subtract two columns ‘age’ and ‘age_next_year’ to get the age difference between the current and next year, and assign the result to a new column ‘age_difference’.
You can also add or subtract columns based on some condition or using more complex mathematical operations. Here’s an example:
# add a new column 'BMI_difference' by subtracting two columns 'BMI' and 'reference_BMI'
df['reference_BMI'] = 22
df['BMI_difference'] = df.apply(lambda row: row['body_mass_index'] - row['reference_BMI'] if row['body_mass_index'] > row['reference_BMI'] else 0, axis=1)
print(df)
Output:
# Index Height(Inches) Weight(Pounds) weight_kg height_m body_mass_index reference_BMI BMI_difference
# 0 1 65.78 112.99 51.251848 1.669556 18.391795 22 0.0
# 1 2 71.52 136.49 61.909600 1.815727 18.751180 22 0.0
# 2 3 69.40 153.03 69.398766 1.760021 22.307196 22 0.307196
# 3 4 68.22 142.34 64.639814 1.732943 21.469178 22 0.0
# 4 5 67.79 144.30 65.454115 1.721801 22.078433 22 0.078433
# .. ... ... ... ... ... ... ... ...
# 195 196
Combine String Columns in Pandas
Here is an example of how to combine string columns in Pandas:
import pandas as pd
# create a dictionary of lists
data = {'first_name': ['Alice', 'Bob', 'Charlie', 'David'],
'last_name': ['Smith', 'Johnson', 'Brown', 'Lee']}
# create a DataFrame from the dictionary
df = pd.DataFrame(data)
# combine the two columns 'first_name' and 'last_name' into a new column 'full_name'
df['full_name'] = df['first_name'] + ' ' + df['last_name']
print(df)
Output:
# first_name last_name full_name
# 0 Alice Smith Alice Smith
# 1 Bob Johnson Bob Johnson
# 2 Charlie Brown Charlie Brown
# 3 David Lee David Lee
In this example, we create a DataFrame df
from a dictionary of lists with two columns ‘first_name’ and ‘last_name’. We then combine these two columns into a new column ‘full_name’ by concatenating them with a space character in between using the addition operator (+
). Finally, we print the DataFrame df
to see the result.
You can also combine string columns based on some condition or using more complex string manipulation functions. Here’s an example:
# combine the two columns 'first_name' and 'last_name' into a new column 'full_name' with title case
df['full_name'] = df[['first_name', 'last_name']].apply(lambda row: row['first_name'].title() + ' ' + row['last_name'].title(), axis=1)
print(df)
Output:
# first_name last_name full_name
# 0 Alice Smith Alice Smith
# 1 Bob Johnson Bob Johnson
# 2 Charlie Brown Charlie Brown
# 3 David Lee David Lee
In this example, we combine the two columns ‘first_name’ and ‘last_name’ into a new column ‘full_name’ with title case using the apply
method and a lambda function. We first select the two columns using double square brackets [['first_name', 'last_name']]
, then apply a lambda function that converts each name to title case and concatenates them with a space character. We use the axis=1
parameter to apply the lambda function to each row of the DataFrame.
Split String Columns in Pandas
String columns can also be divided, which is similar to how they can be joined. Using the str.split()
method, we could immediately assign the result to two columns and divide the Name column into two.
Here’s an example of how to split string columns in Pandas:
import pandas as pd
# create a dictionary of lists
data = {'full_name': ['Alice Smith', 'Bob Johnson', 'Charlie Brown', 'David Lee']}
# create a DataFrame from the dictionary
df = pd.DataFrame(data)
# split the column 'full_name' into two columns 'first_name' and 'last_name'
df[['first_name', 'last_name']] = df['full_name'].str.split(' ', expand=True)
print(df)
Output:
# full_name first_name last_name
# 0 Alice Smith Alice Smith
# 1 Bob Johnson Bob Johnson
# 2 Charlie Brown Charlie Brown
# 3 David Lee David Lee
In this example, we create a DataFrame df
from a dictionary of lists with one column ‘full_name’. We then split this column into two columns ‘first_name’ and ‘last_name’ by using the str.split()
method with a space character as the separator. The expand=True
parameter tells Pandas to expand the resulting list of strings into separate columns. Finally, we print the DataFrame df
to see the result.
You can also split string columns into more than two columns or use more complex regular expressions as separators. Here’s an example:
# split the column 'full_name' into three columns 'title', 'first_name', and 'last_name'
df[['title', 'first_name', 'last_name']] = df['full_name'].str.extract('(\w+)\.\s(\w+)\s(\w+)', expand=True)
print(df)
Output:
# full_name title first_name last_name
# 0 Alice Smith None Alice Smith
# 1 Bob Johnson None Bob Johnson
# 2 Charlie Brown None Charlie Brown
# 3 David Lee None David Lee
In this example, we split the column ‘full_name’ into three columns ‘title’, ‘first_name’, and ‘last_name’ using a regular expression. The regular expression \w+\.\s(\w+)\s(\w+)
matches a word followed by a period (the title), followed by a space character, followed by two words (the first and last name). The parentheses (\w+)
capture the matched strings as separate groups. We use the str.extract()
method to extract the three groups and assign them to separate columns. Since there are no titles in this example, the ‘title’ column contains None
values.
In this code, the isna()
method is used to create a Boolean mask indicating which cells in the DataFrame contain missing values. The sum()
method is then used to count the number of missing values per column. The missing values are printed to the console to show which cells are missing.
The dropna()
method is then used to remove all columns where the number of missing values exceeds a certain threshold. The axis=1
parameter indicates that the operation should be performed on columns. The thresh
parameter is set to len(df) - 1 - 2
to indicate that a column should be kept only if it has at least len(df) - 1 - 2
non-missing values. The inplace=True
parameter ensures that the original DataFrame is updated in place. The updated DataFrame is then printed to show that the columns containing at least two missing values have been dropped.
Dropping Pandas Columns Where a Percentage of Records Are Missing
The proportion of missing values could be a preferable metric to use when determining this criterion. If 50% or more of the values in a column were missing, you might choose to remove those columns.
To drop Pandas columns where a percentage of records are missing, you can use the isna()
method to check which cells contain missing values and mean()
method to calculate the percentage of missing values per column.
Here’s an example:
import pandas as pd
import numpy as np
# create sample data with missing values
data = {'Name': ['John', 'Alice', 'Bob', 'Karen', 'Mike'],
'Age': [25, np.nan, 18, np.nan, 22],
'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney'],
'Salary': [np.nan, 80000, 40000, np.nan, 55000]}
# create DataFrame
df = pd.DataFrame(data)
# print DataFrame
print("Original DataFrame:")
print(df)
# calculate the percentage of missing values per column
missing_values_percentage = df.isna().mean() * 100
# print the missing values percentage
print("Missing values percentage per column:")
print(missing_values_percentage)
# drop columns with a missing values percentage greater than 50%
missing_values_threshold = 50.0
columns_to_drop = missing_values_percentage[missing_values_percentage > missing_values_threshold].index
df.drop(columns_to_drop, axis=1, inplace=True)
# print the updated DataFrame
print("DataFrame after dropping columns with a missing values percentage greater than 50%:")
print(df)
Output:
# Original DataFrame:
# Name Age City Salary
# 0 John 25.0 New York NaN
# 1 Alice NaN Paris 80000.0
# 2 Bob 18.0 London 40000.0
# 3 Karen NaN Tokyo NaN
# 4 Mike 22.0 Sydney 55000.0
# Missing values percentage per column:
# Name 0.0
# Age 40.0
# City 0.0
# Salary 40.0
# dtype: float64
# DataFrame after dropping columns with a missing values percentage greater than 50%:
# Name Age City Salary
# 0 John 25.0 New York NaN
# 1 Alice NaN Paris 80000.0
# 2 Bob 18.0 London 40000.0
# 3 Karen NaN Tokyo NaN
# 4 Mike 22.0 Sydney 55000.0
In this code, the isna()
method is used to create a Boolean mask indicating which cells in the DataFrame contain missing values. The mean()
method is then used to calculate the percentage of missing values per column. The missing values percentage is printed to the console to show which columns have a high percentage of missing values.
The code then drops columns where the percentage of missing values exceeds a certain threshold. The missing_values_threshold
variable is set to 50.0% to indicate that a column should be dropped if its percentage of missing values exceeds 50.0%. The columns_to_drop
variable is used to identify the columns that should be dropped, and the drop()
method is used to remove them. The axis=1
parameter indicates that the operation should be performed on columns. The inplace=True
parameter ensures that the original DataFrame is updated in place. The updated DataFrame is then printed to show that the columns with a missing values percentage greater than 50% have been dropped.
How to Pop Pandas Columns
To pop Pandas columns, you can use the pop()
method. The pop()
method is used to remove a column from the DataFrame and return it as a Series.
Here’s an example:
import pandas as pd
# create sample data
data = {'Name': ['John', 'Alice', 'Bob', 'Karen', 'Mike'],
'Age': [25, 32, 18, 47, 22],
'City': ['New York', 'Paris', 'London', 'Tokyo', 'Sydney'],
'Salary': [60000, 80000, 40000, 120000, 55000]}
# create DataFrame
df = pd.DataFrame(data)
# print DataFrame
print("Original DataFrame:")
print(df)
# pop the 'Age' column from the DataFrame
age_column = df.pop('Age')
# print the updated DataFrame and the popped column
print("DataFrame after popping the 'Age' column:")
print(df)
print("Popped 'Age' column:")
print(age_column)
Output:
# Original DataFrame:
# Name Age City Salary
# 0 John 25 New York 60000
# 1 Alice 32 Paris 80000
# 2 Bob 18 London 40000
# 3 Karen 47 Tokyo 120000
# 4 Mike 22 Sydney 55000
#DataFrame after popping the 'Age' column:
# Name City Salary
# 0 John New York 60000
# 1 Alice Paris 80000
# 2 Bob London 40000
# 3 Karen Tokyo 120000
# 4 Mike Sydney 55000
#
# Popped 'Age' column:
# 0 25
# 1 32
# 2 18
# 3 47
# 4 22
# Name: Age, dtype: int64
In this code, the pop()
method is used to remove the ‘Age’ column from the DataFrame and return it as a Series. The age_column
variable is used to store the popped column for later use. The updated DataFrame is then printed to show that the ‘Age’ column has been removed. Finally, the popped column is printed to the console to show that it has been returned as a Series.
Wrap up
You discovered some helpful techniques in this tutorial for removing columns from DataFrames using Pandas. Both data scientists and analysts should be proficient with working with DataFrames. Your workflow can be made much simpler and more efficient by being aware of the immense versatility that Pandas provides when working with columns!
To learn more about Pandas check out the official documentation here :
https://pandas.pydata.org/pandas-docs/stable/reference/series.html
Thanks for reading. Happy coding!