Data cleaning and preparation are essential steps in any data analysis project. It involves the process of identifying and correcting errors in data, handling missing values, and transforming data into a format that is suitable for analysis.

Pandas provides a number of efficient, flexible, and intuitive methods for cleaning and preparing your data. By the end of this tutorial, you’ll have learned all you need to know to get started with:

  • isnull() and notnull(): These methods can be used to check if a value is missing or not. isnull() returns True for missing values, while notnull() returns True for non-missing values.
  • fillna(): This method can be used to fill missing values with a specified value or method. For example, you can fill missing values with the mean or median of the column, or you can use a forward-fill or backward-fill method to fill in missing values based on the values before or after them.
  • dropna(): This method can be used to remove rows or columns that contain missing values. You can specify the axis (rows or columns) and the threshold for how many missing values are allowed before a row or column is dropped.

Here’s an example of how to use these methods in Pandas:

				
					import pandas as pd

# Create a DataFrame with missing values
df = pd.DataFrame({'A': [1, 2, None, 4],
                   'B': [5, None, 7, 8]})

# Check for missing values
print(df.isnull())

# Fill missing values with the mean of the column
df.fillna(df.mean(), inplace=True)

# Drop rows with missing values
df.dropna(inplace=True)

				
			

In this example, we create a DataFrame with missing values and then use isnull() to check for missing values. We then use fillna() to fill in missing values with the mean of the column, and dropna() to drop rows with missing values.

Understanding the Pandas isnull Method

The isnull() method in Pandas is used to check for missing or null values in a DataFrame or Series. It returns a boolean mask, which is a DataFrame or Series of True and False values, indicating whether each element in the DataFrame or Series is missing or not.

Here’s an example of how to use the isnull() method in Pandas:

				
					import pandas as pd

# Create a DataFrame with missing values
df = pd.DataFrame({'A': [1, 2, None, 4],
                   'B': [5, None, 7, 8]})

# Check for missing values
print(df.isnull())

				
			

Output:

				
					       A      B
0  False  False
1  False   True
2   True  False
3  False  False

				
			

In this example, we create a DataFrame with missing values in columns ‘A’ and ‘B’. We then use the isnull() method to check for missing values in the DataFrame.

As you can see, the isnull() method returns a boolean mask with True values where the DataFrame has missing values and False values where the DataFrame has non-missing values.

You can also use the notnull() method to check for non-missing values in a DataFrame or Series. This method returns the opposite of the isnull() method, so it will return True where there are non-missing values and False where there are missing values.

Counting Missing Values in a Pandas DataFrame

To count the number of missing values in a Pandas DataFrame, you can use the isnull() method in combination with the sum() method. The isnull() method creates a boolean mask indicating where the DataFrame has missing values, and the sum() method adds up the True values in the boolean mask, giving you the total count of missing values.

Here’s an example of how to count missing values in a Pandas DataFrame:

				
					import pandas as pd

# Create a DataFrame with missing values
df = pd.DataFrame({'A': [1, 2, None, 4],
                   'B': [5, None, 7, 8]})

# Count missing values
missing_values_count = df.isnull().sum()

# Print missing values count
print(missing_values_count)

				
			

Output:

				
					A    1
B    1
dtype: int64

				
			

In this example, we create a DataFrame with missing values in columns ‘A’ and ‘B’. We then use the isnull() method to create a boolean mask indicating where the DataFrame has missing values, and the sum() method to count the number of True values in the boolean mask.

As you can see, the sum() method returns a Series with the count of missing values in each column of the DataFrame.

You can also use the notnull() method in combination with the sum() method to count the number of non-missing values in a Pandas DataFrame.

Dropping Missing Data in a Pandas DataFrame

To drop missing data in a Pandas DataFrame, you can use the dropna() method. The dropna() method removes any row or column that contains missing data (i.e., NaN values).

Here’s an example of how to drop missing data in a Pandas DataFrame:

				
					import pandas as pd

# Create a DataFrame with missing values
df = pd.DataFrame({'A': [1, 2, None, 4],
                   'B': [5, None, 7, 8]})

# Drop missing data
df_dropped = df.dropna()

# Print original and dropped DataFrames
print('Original DataFrame:')
print(df)
print('Dropped DataFrame:')
print(df_dropped)

				
			

Output:

				
					#Original DataFrame:
#     A    B
#0  1.0  5.0
#1  2.0  NaN
#2  NaN  7.0
#3  4.0  8.0
#Dropped DataFrame:
#     A    B
#0  1.0  5.0
#3  4.0  8.0

				
			

As you can see, the dropna() method removes any row that contains missing data, so the resulting DataFrame has only the rows that don’t contain missing data.

You can also use the dropna() method to drop columns that contain missing data by specifying axis=1 as an argument.

Filling Missing Data in a Pandas DataFrame

To fill missing data in a Pandas DataFrame, you can use the fillna() method. The fillna() method replaces any missing data (i.e., NaN values) with a specified value or method.

Here’s an example of how to fill missing data in a Pandas DataFrame:

				
					import pandas as pd

# Create a DataFrame with missing values
df = pd.DataFrame({'A': [1, 2, None, 4],
                   'B': [5, None, 7, 8]})

# Fill missing data with a specified value
df_filled = df.fillna(value=0)

# Print original and filled DataFrames
print('Original DataFrame:')
print(df)
print('Filled DataFrame:')
print(df_filled)

				
			

Output:

				
					#Original DataFrame:
#     A    B
#0  1.0  5.0
#1  2.0  NaN
#2  NaN  7.0
#3  4.0  8.0
#Filled DataFrame:
#     A    B
#0  1.0  5.0
#1  2.0  0.0
#2  0.0  7.0
#3  4.0  8.0

				
			

In this example, we create a DataFrame with missing values in columns ‘A’ and ‘B’. We then use the fillna() method to replace any missing data with the value 0. 

As you can see, the fillna() method replaces any missing data with the value 0, so the resulting DataFrame has no missing data.

You can also use the fillna() method to replace missing data with a method such as the mean, median, or mode of the non-missing data in the column.

Working with Duplicate Data in Pandas

Pandas provides several methods to identify and remove duplicate data in a DataFrame. The main methods are:

  1. duplicated(): This method returns a Boolean Series that indicates whether each row is a duplicate (i.e., has been observed before) or not.
  2. drop_duplicates(): This method removes duplicate rows from a DataFrame.

Here’s an example of how to identify and remove duplicate data in a Pandas DataFrame:

				
					import pandas as pd

# Create a DataFrame with duplicate rows
df = pd.DataFrame({'A': [1, 2, 2, 3],
                   'B': [4, 5, 5, 6]})

# Identify duplicate rows
duplicate_rows = df.duplicated()

# Print the duplicate rows
print('Duplicate Rows:')
print(df[duplicate_rows])

# Remove duplicate rows
df_clean = df.drop_duplicates()

# Print the clean DataFrame
print('Clean DataFrame:')
print(df_clean)

				
			

Output:

				
					#Duplicate Rows:
#   A  B
#2  2  5
#Clean DataFrame:
#   A  B
#0  1  4
#1  2  5
#3  3  6

				
			

In this example, we create a DataFrame with duplicate rows in columns ‘A’ and ‘B’. We then use the duplicated() method to identify the duplicate rows, and the drop_duplicates() method to remove them.

As you can see, the duplicated() method identifies the row with values [2, 5] as a duplicate, and the drop_duplicates() method removes it. The resulting DataFrame has no duplicate rows.

You can also specify which columns to use for identifying duplicates using the subset parameter of the duplicated() and drop_duplicates() methods.

Identifying Duplicate Records in a Pandas DataFrame

To identify duplicate records in a Pandas DataFrame, you can use the duplicated() method. This method returns a Boolean Series that indicates whether each row is a duplicate (i.e., has been observed before) or not.

Here’s an example of how to identify duplicate records in a Pandas DataFrame:

				
					import pandas as pd

# Create a DataFrame with duplicate records
df = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie', 'Alice'],
                   'Age': [25, 30, 35, 25],
                   'City': ['New York', 'Paris', 'London', 'New York']})

# Identify duplicate records
duplicate_records = df.duplicated()

# Print the duplicate records
print('Duplicate Records:')
print(df[duplicate_records])

				
			

Output:

				
					#  Duplicate Records:
#    Name  Age      City
#3  Alice   25  New York

				
			

As you can see, the duplicated() method identifies the record with values ['Alice', 25, 'New York'] as a duplicate. The resulting Boolean Series has a value of True for this row and False for all other rows.

You can also specify which columns to use for identifying duplicates using the subset parameter of the duplicated() method.

Removing Duplicate Data in a Pandas DataFrame

To remove duplicate data in a Pandas DataFrame, you can use the drop_duplicates() method. This method removes duplicate rows from a DataFrame based on the specified columns.

Here’s an example of how to remove duplicate data in a Pandas DataFrame:

				
					import pandas as pd

# Create a DataFrame with duplicate rows
df = pd.DataFrame({'A': [1, 2, 2, 3],
                   'B': [4, 5, 5, 6]})

# Remove duplicate rows
df_clean = df.drop_duplicates()

# Print the clean DataFrame
print('Clean DataFrame:')
print(df_clean)

				
			

Output:

				
					Clean DataFrame:
   A  B
0  1  4
1  2  5
3  3  6

				
			

As you can see, the drop_duplicates() method removes the row with values [2, 5], which is a duplicate of the row above it.

You can also specify which columns to use for identifying duplicates using the subset parameter of the drop_duplicates() method.

Cleaning Strings in Pandas

To clean strings in Pandas, you can use string methods provided by Pandas or Python’s built-in string methods. Some common string cleaning operations include:

  • Removing leading and trailing whitespace
  • Converting to lowercase or uppercase
  • Removing or replacing specific characters
  • Splitting and joining strings

Here’s an example of how to clean strings in a Pandas DataFrame using some of these methods:

				
					import pandas as pd

# Create a DataFrame with strings to clean
df = pd.DataFrame({'Name': [' Alice ', 'bob', 'CharliE', 'dave '],
                   'Email': ['ALICE@gmail.com', 'Bob@outlook.com', 'Charlie@gmail.com', 'DAVE@GMAIL.COM']})

# Clean the 'Name' column
df['Name'] = df['Name'].str.strip().str.title()

# Clean the 'Email' column
df['Email'] = df['Email'].str.lower()

# Print the cleaned DataFrame
print('Cleaned DataFrame:')
print(df)

				
			

Output:

				
					#Cleaned DataFrame:
#      Name                Email
#0    Alice      alice@gmail.com
#1      Bob      bob@outlook.com
#2  Charlie   charlie@gmail.com
#3     Dave      dave@gmail.com

				
			

In this example, we create a DataFrame with strings in the ‘Name’ and ‘Email’ columns. We then use the str.strip() method to remove leading and trailing whitespace from the ‘Name’ column, and the str.title() method to convert the strings in the ‘Name’ column to title case. We also use the str.lower() method to convert the strings in the ‘Email’ column to lowercase.

You can see, the strings in the ‘Name’ and ‘Email’ columns have been cleaned. You can also use regular expressions (regex) to perform more complex string cleaning operations in Pandas.

Trimming White Space in Pandas Strings

To trim white space in Pandas strings, you can use the str.strip() method. This method removes leading and trailing white space from a string.

Here’s an example of how to trim white space in a Pandas DataFrame:

				
					import pandas as pd

# Create a DataFrame with strings containing white space
df = pd.DataFrame({'Name': [' Alice ', 'bob', 'CharliE', 'dave ']})

# Trim white space from the 'Name' column
df['Name'] = df['Name'].str.strip()

# Print the cleaned DataFrame
print('Cleaned DataFrame:')
print(df)

				
			

Output:

				
					Cleaned DataFrame:
     Name
0   Alice
1     bob
2  CharliE
3    dave

				
			

In this example, we create a DataFrame with strings containing leading and trailing white space in the ‘Name’ column. We then use the str.strip() method to trim the white space from the ‘Name’ column.

As you can see, the white space has been removed from the strings in the ‘Name’ column.

You can also use the str.lstrip() method to remove only leading white space, or the str.rstrip() method to remove only trailing white space.

Splitting Strings into Columns in Pandas

Sure, I can help you with splitting strings into columns in Pandas!

To split strings into columns in Pandas, you can use the str.split() method. This method splits a string into a list of substrings based on a specified delimiter. You can then use the expand parameter to split the list into separate columns.

Here’s an example of how to split strings into columns in a Pandas DataFrame:

				
					import pandas as pd

# Create a DataFrame with a column of strings to split
df = pd.DataFrame({'Name': ['Alice, Smith', 'Bob, Johnson', 'Charlie, Brown']})

# Split the 'Name' column into 'First' and 'Last' columns
df[['First', 'Last']] = df['Name'].str.split(', ', expand=True)

# Print the DataFrame with the split columns
print(df)

				
			

Output:

				
					#             Name    First    Last
#0    Alice, Smith    Alice   Smith
#1   Bob, Johnson      Bob  Johnson
#2  Charlie, Brown  Charlie   Brown

				
			

In this example, we create a DataFrame with a column of strings in the ‘Name’ column. We then use the str.split() method to split the strings into ‘First’ and ‘Last’ names based on the comma delimiter. We use the expand parameter to split the list of names into separate columns. 

As you can see, the ‘Name’ column has been split into separate ‘First’ and ‘Last’ columns.  You can also use other delimiters, such as whitespace or a hyphen, to split strings into columns. If the number of substrings generated by the split method is different across rows, you can use the maxsplit parameter to specify the maximum number of splits to make.

Replacing Text in Strings in Pandas

To replace text in strings in Pandas, you can use the str.replace() method. This method replaces occurrences of a specified substring with a replacement string.

Here’s an example of how to replace text in strings in a Pandas DataFrame:

				
					import pandas as pd

# Create a DataFrame with a column of strings to replace text in
df = pd.DataFrame({'Text': ['apples and oranges', 'bananas and grapes', 'pears and apples']})

# Replace 'apples' with 'kiwis' in the 'Text' column
df['Text'] = df['Text'].str.replace('apples', 'kiwis')

# Print the DataFrame with the replaced text
print(df)

				
			

Output:

				
					#                Text
#0  kiwis and oranges
#1  bananas and grapes
#2    pears and kiwis

				
			

In this example, we create a DataFrame with a column of strings in the ‘Text’ column. We then use the str.replace() method to replace occurrences of ‘apples’ with ‘kiwis’ in the ‘Text’ column.

As you can see, the occurrences of ‘apples’ in the ‘Text’ column have been replaced with ‘kiwis’. You can also use regular expressions with the str.replace() method to replace more complex patterns in strings.

Changing String Case in Pandas

To change string case in Pandas, you can use the str.lower(), str.upper(), str.title(), or str.capitalize() methods. These methods convert all characters in a string to lowercase, uppercase, title case, or capitalize the first character of the string, respectively.

Here’s an example of how to change string case in a Pandas DataFrame:

				
					import pandas as pd

# Create a DataFrame with a column of strings to change case
df = pd.DataFrame({'Text': ['apples and oranges', 'BANANAS and grapes', 'pears and APPLES']})

# Change the case of the 'Text' column
df['Lower'] = df['Text'].str.lower()
df['Upper'] = df['Text'].str.upper()
df['Title'] = df['Text'].str.title()
df['Capitalize'] = df['Text'].str.capitalize()

# Print the DataFrame with the changed case
print(df)

				
			

Output:

				
					#                Text                Lower                Upper                  Title            Capitalize
#0  apples and oranges  apples and oranges  APPLES AND ORANGES     Apples And Oranges  Apples and oranges
#1  BANANAS and grapes  bananas and grapes  BANANAS AND GRAPES     Bananas And Grapes  Bananas and grapes
#2    pears and APPLES    pears and apples    PEARS AND APPLES  Pears And Apples      Pears and apples

				
			

In this example, we create a DataFrame with a column of strings in the ‘Text’ column. We then use the str.lower(), str.upper(), str.title(), and str.capitalize() methods to change the case of the ‘Text’ column and create new columns for each case change.

Wrap up

In this session, we covered various data cleaning techniques in Pandas. We discussed how to handle missing data, including identifying missing data with the isnull() method, counting missing data with the isnull().sum() method, and dropping or filling missing data using the dropna() and fillna() methods, respectively.

We also covered working with duplicate data, including identifying duplicates with the duplicated() method and removing duplicates with the drop_duplicates() method.

Finally, we discussed cleaning strings in Pandas, including trimming white space with the str.strip() method, splitting strings into columns with the str.split() method, and replacing text in strings with the str.replace() method, as well as changing string case with the str.lower(), str.upper(), str.title(), and str.capitalize() methods.

By using these techniques, you can clean and manipulate your data to prepare it for analysis and modeling.

Here you find Pandas’ Official Documentation
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pop.html


Thanks for reading. Happy coding!