In this article, we will discuss the precision-recall curve in Python. Precision-recall is a metric used to evaluate the performance of a binary classification model. It is especially useful when the classes are imbalanced. We will walk through the steps of creating a precision-recall curve in Python, and we will also discuss how to interpret the results.

What is a Precision-Recall Curve

A precision-recall curve is a graphical representation of the trade-off between precision and recall for different threshold values used in binary classification. Precision represents the number of true positives (correctly classified positive instances) divided by the total number of positive predictions. Recall represents the number of true positives divided by the total number of actual positives.

Step 1: Import Packages

The first step in building a precision-recall curve is to import the necessary libraries. We will be using scikit-learn, matplotlib, and numpy.

				
					import numpy as np
import pandas as pd
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
				
			

Step 2: Fit the Logistic Regression Model

To fit a logistic regression model with four predictor variables, you can create a dataset that contains the predictor variables and the target variable.

Here’s an example of how you can create a dataset with four predictor variables:

				
					# Define the predictor variables
var1 = np.random.normal(0, 1, 1000)
var2 = np.random.normal(0, 1, 1000)
var3 = np.random.normal(0, 1, 1000)
var4 = np.random.normal(0, 1, 1000)

# Combine the predictor variables into a DataFrame
df = pd.DataFrame({'var1': var1, 'var2': var2, 'var3': var3, 'var4': var4})

# Define the target variable
target = np.random.binomial(1, 0.5, 1000)

# Fit the logistic regression model
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(df, target)
				
			

In this example, four predictor variables (var1, var2, var3, and var4) are generated using numpy‘s random.normal function, which creates a random sample from a normal distribution with mean 0 and standard deviation 1. These predictor variables are then combined into a pandas DataFrame (df).

A target variable is also generated using numpy‘s random.binomial function, which generates random samples from a binomial distribution with a probability of 0.5. The logistic regression model is then fitted to the predictor variables and target variable using scikit-learn’s LogisticRegression class.

Step 3: Create the Precision-Recall Curve

To create the Precision-Recall curve for the logistic regression model with four predictor variables, you can follow these steps:

				
					# Make predictions on the dataset
predictions = model.predict_proba(df)[:, 1]

# Compute precision and recall values
precision, recall, thresholds = precision_recall_curve(target, predictions)

# Plot the Precision-Recall curve
plt.plot(recall, precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()
				
			

In this code, the predict_proba method is called on the df DataFrame to obtain the predicted probabilities of the positive class. The precision_recall_curve function from scikit-learn is then used to compute the precision and recall values for different probability thresholds.

Finally, the plot function from matplotlib is used to visualize the Precision-Recall curve, with recall on the x-axis and precision on the y-axis. The resulting plot will show the trade-off between precision and recall for different probability thresholds, and can be used to evaluate the performance of the logistic regression model.

All code in one:

				
					import numpy as np
import pandas as pd
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt

# Define the predictor variables
var1 = np.random.normal(0, 1, 1000)
var2 = np.random.normal(0, 1, 1000)
var3 = np.random.normal(0, 1, 1000)
var4 = np.random.normal(0, 1, 1000)

# Combine the predictor variables into a DataFrame
df = pd.DataFrame({'var1': var1, 'var2': var2, 'var3': var3, 'var4': var4})

# Define the target variable
target = np.random.binomial(1, 0.5, 1000)

# Fit the logistic regression model
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(df, target)

# Make predictions on the dataset
predictions = model.predict_proba(df)[:, 1]

# Compute precision and recall values
precision, recall, thresholds = precision_recall_curve(target, predictions)

# Plot the Precision-Recall curve
plt.plot(recall, precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()
				
			

Wrap up

In this tutorial, we learned how to create a Precision-Recall curve in Python for a logistic regression model with four predictor variables. We first generated a dataset with four predictor variables and a target variable, and fitted a logistic regression model using scikit-learn’s LogisticRegression class.

We then used the predict_proba method to obtain predicted probabilities for the positive class, and the precision_recall_curve function to compute the precision and recall values for different probability thresholds. Finally, we used matplotlib to visualize the Precision-Recall curve.

The resulting curve shows the trade-off between precision and recall for different probability thresholds, and can be used to evaluate the performance of the logistic regression model. By examining the curve, we can choose a threshold that balances precision and recall according to our needs.

To learn more about Precision-Recall Curve  check out the:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html


Thanks for reading. Happy coding!