Pandas is a powerful Python library for data analysis, providing data structures and functions needed to work with structured data seamlessly. One of the most common challenges in data analysis is dealing with missing data. In this comprehensive guide, we will dive into the fillna
method and explore various strategies for tackling missing data in DataFrames using Pandas.
Understanding Missing Data and Its Impact
In Pandas, missing data is represented by the NaN
(Not a Number) or None
values. These values often occur due to errors during data collection, data entry, or data processing. Before we start exploring the fillna
method, it is essential to understand how to detect missing data in a DataFrame. Pandas provide several methods to identify missing values, such as isna()
, isnull()
, and notna()
.
import pandas as pd
# Sample DataFrame with missing data
data = {'A': [1, 2, None, 4],
'B': [None, 2, 3, 4],
'C': [1, None, None, 4]}
df = pd.DataFrame(data)
print(df.isna())
Output:
# A B C
# 0 False True False
# 1 False False True
# 2 True False True
# 3 False False False
Exploring the fillna Method
The fillna
method in Pandas is designed to fill missing values in a DataFrame using a specified method or value. It has several parameters that give you flexibility and control over the filling process. Some common parameters include:
value
: Scalar, dict, Series, or DataFrame used to fill missing values.method
: Method to use for filling holes in reindexed Series (pad
,ffill
,bfill
,None
).axis
: Axis along which to fill missing values (0 or ‘index’, 1 or ‘columns’).inplace
: IfTrue
, fill in-place, otherwise, return a new object.limit
: Maximum number of consecutive missing values to fill.
Loading a Sample Pandas DataFrame
I’ve included a sample Pandas DataFrame below so that you can follow the instruction line-by-line. Simply copy the code and paste it in your preferred code editor. Although your results will undoubtedly differ, feel free to use your own DataFrame if you have one.
import pandas as pd
# create a dictionary of lookup values and results
country_map = {'USA': 'United States', 'Canada': 'Canada', 'Australia': 'Australia', 'UK': 'United Kingdom'}
# create a dataframe
data = {'Name': ['John', 'Emma', 'Peter', 'Hannah'],
'Age': [25, 30, 21, 35],
'Country': ['USA', 'Canada', 'Australia', 'UK']}
df = pd.DataFrame(data)
# use .map() method to replace values in the 'Country' column
df['Country'] = df['Country'].map(country_map)
# print the updated dataframe
print(df)
Output:
# A B C
# 0 False True False
# 1 False False True
# 2 True False True
# 3 False False False
This will create a Pandas DataFrame with four columns: ‘Name’, ‘Age’, ‘City’, and ‘Country’, and four rows of sample data. You can modify the data in the dictionary to create your own custom DataFrame.
Using Pandas fillna() To Fill with 0
In the same way as the given example, to replace all the missing values in a Pandas column with a constant value, we just need to provide that value to the .fillna()
method’s value=
argument. The value will be adapted to fit the column’s data type.
Here’s an example:
import pandas as pd
# Create a dictionary with sample data (including a missing value)
data = {
'Name': ['John', 'Jane', 'Adam', 'Emily', 'Mark'],
'Age': [25, 30, 21, None, 28], # Add a missing value
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'],
'Country': ['USA', 'USA', 'USA', 'USA', 'USA']
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
# Fill missing value in 'Age' column with 0
df['Age'].fillna(0, inplace=True) # Fill missing value with 0
# Print the DataFrame with missing value filled
print(df)
Output:
# Name Age City Country
# 0 John 25.0 New York USA
# 1 Jane 30.0 Los Angeles USA
# 2 Adam 21.0 Chicago USA
# 3 Emily 26.5 Houston USA
# 4 Mark 28.0 Miami USA
In this example, we first create a dictionary data
with a missing value in the ‘Age’ column. Then we create a DataFrame df
from this dictionary.
Next, we use the fillna()
method to fill the missing value in the ‘Age’ column with 0. The inplace=True
argument is used to modify the DataFrame in place instead of creating a copy.
Finally, we print the updated DataFrame with the missing value filled using the print()
function.
Using Pandas fillna() To Fill with a Constant Value
In the same way as the given example, to replace all the missing values in a Pandas column with a constant value, we just need to provide that value to the .fillna()
method’s value= argument
. The value will be adapted to fit the column’s data type.
Here’s an example:
import pandas as pd
# Create a dictionary with sample data (including a missing value)
data = {
'Name': ['John', 'Jane', 'Adam', 'Emily', 'Mark'],
'Age': [25, 30, 21, None, 28], # Add a missing value
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'],
'Country': ['USA', 'USA', 'USA', 'USA', 'USA']
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
# Fill missing value in 'Age' column with a constant value of 99
df['Age'].fillna(99, inplace=True) # Fill missing value with constant value of 99
# Print the DataFrame with missing value filled
print(df)
Output:
# Name Age City Country
# 0 John 25.0 New York USA
# 1 Jane 30.0 Los Angeles USA
# 2 Adam 21.0 Chicago USA
# 3 Emily 99.0 Houston USA
# 4 Mark 28.0 Miami USA
In this example, we first create a dictionary data
with a missing value in the ‘Age’ column. Then we create a DataFrame df
from this dictionary.
Next, we use the fillna()
method to fill the missing value in the ‘Age’ column with a constant value of 99. The inplace=True
argument is used to modify the DataFrame in place instead of creating a copy.
Finally, we print the updated DataFrame with the missing value filled using the print()
function.
Using Pandas fillna() To Fill with the Mean
To replace all missing values in a column with the column’s mean, you can utilize the .fillna()
method along with the column’s mean value. Let’s explore how to use the Pandas .mean()
method to substitute missing values with the mean.
Here’s an example:
import pandas as pd
# Create a dictionary with sample data (including a missing value)
data = {
'Name': ['John', 'Jane', 'Adam', 'Emily', 'Mark'],
'Age': [25, 30, 21, None, 28], # Add a missing value
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'],
'Country': ['USA', 'USA', 'USA', 'USA', 'USA']
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
# Fill missing value in 'Age' column with the mean age
mean_age = df['Age'].mean() # Calculate mean age
df['Age'].fillna(mean_age, inplace=True) # Fill missing value with mean age
# Print the DataFrame with missing value filled
print(df)
Output:
# Name Age City Country
# 0 John 25.0 New York USA
# 1 Jane 30.0 Los Angeles USA
# 2 Adam 21.0 Chicago USA
# 3 Emily 26.0 Houston USA
# 4 Mark 28.0 Miami USA
In this example, we first create a dictionary data
with a missing value in the ‘Age’ column. Then we create a DataFrame df
from this dictionary.
Next, we calculate the mean age of the non-missing values using the mean()
method. We then use the fillna()
method to fill the missing value in the ‘Age’ column with the mean age. The inplace=True
argument is used to modify the DataFrame in place instead of creating a copy.
Finally, we print the updated DataFrame with the missing value filled using the print()
function.
The advantage of this method is that it enables us to employ any other type of computed value, such as the median or the mode of a dataset.
Using Pandas fillna() To Fill with a String
Likewise, we can provide a string to replace all missing values with the specified string. This operates in the same manner as inputting a constant value. Let’s see how we can use the string 'Missing'
to fill all missing values in the 'Name'
column
Here’s an example:
import pandas as pd
# Create a dictionary with sample data (including a missing value)
data = {
'Name': ['John', 'Jane', 'Adam', 'Emily', 'Mark'],
'Age': [25, 30, 21, None, 28], # Add a missing value
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami'],
'Country': ['USA', 'USA', 'USA', 'USA', None] # Add a missing value
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
# Fill missing value in 'Country' column with a string 'Unknown'
df['Country'].fillna('Unknown', inplace=True) # Fill missing value with 'Unknown'
# Print the DataFrame with missing value filled
print(df)
Output:
# Name Age City Country
# 0 John 25.0 New York USA
# 1 Jane 30.0 Los Angeles USA
# 2 Adam 21.0 Chicago USA
# 3 Emily NaN Houston USA
# 4 Mark 28.0 Miami Unknown
In this example, we first create a dictionary data
with a missing value in the ‘Age’ column and a missing value in the ‘Country’ column. Then we create a DataFrame df
from this dictionary.
Next, we use the fillna()
method to fill the missing value in the ‘Country’ column with the string ‘Unknown’. The inplace=True
argument is used to modify the DataFrame in place instead of creating a copy.
Finally, we print the updated DataFrame with the missing value filled using the print()
function.
Using Pandas fillna() to Fill Missing Values in an Entire DataFrame
To populate missing values in an entire Pandas DataFrame, we can just input a fill value into the value=
parameter of the .fillna()
method. The method will try to preserve the original column’s data type, if feasible.
Here’s an example:
import pandas as pd
import numpy as np
# Create a dictionary with sample data (including missing values)
data = {
'Name': ['John', 'Jane', 'Adam', None, 'Mark'],
'Age': [25, None, 21, None, 28],
'City': ['New York', 'Los Angeles', None, 'Houston', 'Miami'],
'Country': [None, 'USA', 'USA', 'USA', None]
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
# Fill all missing values in the DataFrame with 0
df.fillna(0, inplace=True)
# Print the updated DataFrame with missing values filled
print(df)
Output:
# Name Age City Country
# 0 John 25.0 New York 0
# 1 Jane 0.0 Los Angeles USA
# 2 Adam 21.0 0 USA
# 3 0 0.0 Houston USA
# 4 Mark 28.0 Miami 0
In this example, we first create a dictionary data
with missing values in multiple columns. Then we create a DataFrame df
from this dictionary.
Next, we use the fillna()
method to fill all missing values in the DataFrame with 0. The inplace=True
argument is used to modify the DataFrame in place instead of creating a copy.
Finally, we print the updated DataFrame with all missing values filled using the print()
function.
Using Pandas fillna() to Fill Missing Values in Specific DataFrame Columns
Up to this point, we’ve discussed filling missing data for either a single column or the entire DataFrame. Pandas enables you to input a dictionary of column-value pairs, which can be used to replace missing values in designated columns with specific values.
Here’s an example:
import pandas as pd
import numpy as np
# Create a dictionary with sample data (including missing values)
data = {
'Name': ['John', 'Jane', 'Adam', None, 'Mark'],
'Age': [25, None, 21, None, 28],
'City': ['New York', 'Los Angeles', None, 'Houston', 'Miami'],
'Country': [None, 'USA', 'USA', 'USA', None]
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
# Fill missing values in the 'Age' and 'Country' columns with a constant value of 0
df['Age'].fillna(0, inplace=True)
df['Country'].fillna(0, inplace=True)
# Print the updated DataFrame with missing values filled
print(df)
Output:
# Name Age City Country
# 0 John 25.0 New York 0
# 1 Jane 0.0 Los Angeles USA
# 2 Adam 21.0 None USA
# 3 None 0.0 Houston USA
# 4 Mark 28.0 Miami 0
In this example, we first create a dictionary data
with missing values in multiple columns. Then we create a DataFrame df
from this dictionary.
Next, we use the fillna()
method to fill missing values in the ‘Age’ and ‘Country’ columns with a constant value of 0. We do this by specifying the column name and the constant value as arguments to the fillna()
method for each column.
Finally, we print the updated DataFrame with the specified missing values filled using the print()
function.
Using Pandas fillna() to Back Fill or Forward Fill Data
The Pandas .fillna()
method additionally enables you to fill gaps in your data by utilizing the previous or subsequent observations. This technique is referred to as forward-filling or back-filling the data.
Here’s an example:
import pandas as pd
import numpy as np
# Create a dictionary with sample data (including missing values)
data = {
'Name': ['John', 'Jane', 'Adam', None, 'Mark'],
'Age': [25, None, 21, None, 28],
'City': ['New York', 'Los Angeles', None, 'Houston', 'Miami'],
'Country': [None, 'USA', 'USA', 'USA', None]
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
# Forward fill missing values in 'City' column
df['City'].fillna(method='ffill', inplace=True)
# Back fill missing values in 'Name' column
df['Name'].fillna(method='bfill', inplace=True)
# Print the updated DataFrame with missing values filled
print(df)
Output:
# Name Age City Country
# 0 John 25.0 New York None
# 1 Jane NaN Los Angeles USA
# 2 Adam 21.0 Los Angeles USA
# 3 Mark NaN Houston USA
# 4 Mark 28.0 Miami None
In this example, we first create a dictionary data
with missing values in multiple columns. Then we create a DataFrame df
from this dictionary.
Next, we use the fillna()
method to forward fill missing values in the ‘City’ column using the method='ffill'
argument. This fills missing values with the previous non-missing value in the same column.
Similarly, we use the fillna()
method to back fill missing values in the ‘Name’ column using the method='bfill'
argument. This fills missing values with the next non-missing value in the same column.
Finally, we print the updated DataFrame with the forward filled and back filled missing values using the print()
function.
Limiting the Number of Consecutive Missing Data Filled with Pandas fillna()
When employing the method= parameter of the .fillna() method, you might not want to fill an entire gap in your data. By using the limit= parameter, you can designate the maximum number of consecutive missing values to forward-fill or back-fill.
Let’s explore how we can apply this parameter to constrain the number of values filled in a gap within our data:
import pandas as pd
import numpy as np
# Create a dictionary with sample data (including missing values)
data = {
'Name': ['John', None, None, None, 'Mark'],
'Age': [25, None, None, None, 28],
'City': ['New York', None, None, None, 'Miami'],
'Country': [None, 'USA', 'USA', 'USA', None]
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
# Forward fill missing values in 'Name', 'Age', and 'City' columns, limiting to 2 consecutive missing values
df[['Name', 'Age', 'City']] = df[['Name', 'Age', 'City']].fillna(method='ffill', limit=2)
# Back fill missing values in 'Country' column, limiting to 1 consecutive missing value
df['Country'] = df['Country'].fillna(method='bfill', limit=1)
# Print the updated DataFrame with missing values filled
print(df)
Output:
# Name Age City Country
# 0 John 25.0 New York USA
# 1 John 25.0 New York USA
# 2 John 25.0 New York USA
# 3 None NaN None USA
# 4 Mark 28.0 Miami None
In this example, we first create a dictionary data
with missing values in multiple columns. Then we create a DataFrame df
from this dictionary.
Next, we use the fillna()
method to forward fill missing values in the ‘Name’, ‘Age’, and ‘City’ columns using the method='ffill'
argument and limiting to 2 consecutive missing values using the limit=2
argument. This fills missing values with the previous non-missing value in the same column, but only if there are 2 or fewer consecutive missing values.
Similarly, we use the fillna()
method to back fill missing values in the ‘Country’ column using the method='bfill'
argument and limiting to 1 consecutive missing value using the limit=1
argument. This fills missing values with the next non-missing value in the same column, but only if there is 1 or fewer consecutive missing values.
Finally, we print the updated DataFrame with the limited consecutive missing values filled using the print()
function.
Using Pandsa fillna() with groupby and transform
In this section, we’re going to explore using the Pandas .fillna()
method to fill data across different categories. You can use this method in Pandas with groupby()
and transform()
to fill missing values within groups in a DataFrame.
Here’s an example:
import pandas as pd
# Create a dictionary with sample data (including missing values)
data = {
'Name': ['John', 'Jane', 'Adam', None, 'Mark', 'Emily', None, 'Mike', 'Emma', 'David'],
'Age': [25, 30, 21, None, 28, 32, None, 27, 24, None],
'Gender': ['Male', 'Female', 'Male', 'Male', 'Male', 'Female', 'Female', 'Male', 'Female', 'Male'],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Miami', 'Chicago', 'Miami', 'Los Angeles', 'New York', 'Houston']
}
# Create a DataFrame from the dictionary
df = pd.DataFrame(data)
# Use groupby() and transform() to fill missing 'Age' values with the mean age of each group
df['Age'] = df.groupby('Gender')['Age'].transform(lambda x: x.fillna(x.mean()))
# Print the updated DataFrame with missing values filled
print(df)
Output:
Name Age Gender City
0 John 25.000000 Male New York
1 Jane 30.000000 Female Los Angeles
2 Adam 21.000000 Male Chicago
3 None 25.250000 Male Houston
4 Mark 28.000000 Male Miami
5 Emily 32.000000 Female Chicago
6 None 28.666667 Female Miami
7 Mike 27.000000 Male Los Angeles
8 Emma 24.000000 Female New York
9 David 25.250000 Male Houston
In this example, we first create a dictionary data
with missing values in the ‘Name’ and ‘Age’ columns. Then we create a DataFrame df
from this dictionary.
Next, we use the groupby()
method to group the DataFrame by the ‘Gender’ column. Then we use the transform()
method to fill missing ‘Age’ values with the mean age of each group using the fillna()
method. We do this by applying a lambda function that fills missing values with the mean age of the group using x.mean()
.
Finally, we print the updated DataFrame with the missing ‘Age’ values filled using the print()
function.
Wrap up
We covered how to use the fillna()
method in Pandas to fill missing values in a DataFrame. We discussed various scenarios, including filling missing values in a single column, filling missing values in an entire DataFrame, filling missing values with a constant value or a string, and forward filling or back filling missing data.
We also covered how to limit the number of consecutive missing data filled and how to use fillna()
with groupby()
and transform()
to fill missing values within groups in a DataFrame.
Overall, the fillna()
method is a powerful tool that allows you to handle missing data in a flexible and customizable way using Pandas.
To learn more about the Pandas .fillna()
method, check out the official documentation:
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.htm
Thanks for reading. Happy coding!