In this article, we will discuss the precision-recall curve in Python. Precision-recall is a metric used to evaluate the performance of a binary classification model. It is especially useful when the classes are imbalanced. We will walk through the steps of creating a precision-recall curve in Python, and we will also discuss how to interpret the results.
What is a Precision-Recall Curve
A precision-recall curve is a graphical representation of the trade-off between precision and recall for different threshold values used in binary classification. Precision represents the number of true positives (correctly classified positive instances) divided by the total number of positive predictions. Recall represents the number of true positives divided by the total number of actual positives.
Step 1: Import Packages
The first step in building a precision-recall curve is to import the necessary libraries. We will be using scikit-learn, matplotlib, and numpy.
import numpy as np
import pandas as pd
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
Step 2: Fit the Logistic Regression Model
To fit a logistic regression model with four predictor variables, you can create a dataset that contains the predictor variables and the target variable.
Here’s an example of how you can create a dataset with four predictor variables:
# Define the predictor variables
var1 = np.random.normal(0, 1, 1000)
var2 = np.random.normal(0, 1, 1000)
var3 = np.random.normal(0, 1, 1000)
var4 = np.random.normal(0, 1, 1000)
# Combine the predictor variables into a DataFrame
df = pd.DataFrame({'var1': var1, 'var2': var2, 'var3': var3, 'var4': var4})
# Define the target variable
target = np.random.binomial(1, 0.5, 1000)
# Fit the logistic regression model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(df, target)
In this example, four predictor variables (var1
, var2
, var3
, and var4
) are generated using numpy
‘s random.normal
function, which creates a random sample from a normal distribution with mean 0 and standard deviation 1. These predictor variables are then combined into a pandas DataFrame (df
).
A target variable is also generated using numpy
‘s random.binomial
function, which generates random samples from a binomial distribution with a probability of 0.5. The logistic regression model is then fitted to the predictor variables and target variable using scikit-learn’s LogisticRegression
class.
Step 3: Create the Precision-Recall Curve
To create the Precision-Recall curve for the logistic regression model with four predictor variables, you can follow these steps:
# Make predictions on the dataset
predictions = model.predict_proba(df)[:, 1]
# Compute precision and recall values
precision, recall, thresholds = precision_recall_curve(target, predictions)
# Plot the Precision-Recall curve
plt.plot(recall, precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()
In this code, the predict_proba
method is called on the df
DataFrame to obtain the predicted probabilities of the positive class. The precision_recall_curve
function from scikit-learn is then used to compute the precision and recall values for different probability thresholds.
Finally, the plot
function from matplotlib
is used to visualize the Precision-Recall curve, with recall
on the x-axis and precision
on the y-axis. The resulting plot will show the trade-off between precision and recall for different probability thresholds, and can be used to evaluate the performance of the logistic regression model.
All code in one:
import numpy as np
import pandas as pd
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
# Define the predictor variables
var1 = np.random.normal(0, 1, 1000)
var2 = np.random.normal(0, 1, 1000)
var3 = np.random.normal(0, 1, 1000)
var4 = np.random.normal(0, 1, 1000)
# Combine the predictor variables into a DataFrame
df = pd.DataFrame({'var1': var1, 'var2': var2, 'var3': var3, 'var4': var4})
# Define the target variable
target = np.random.binomial(1, 0.5, 1000)
# Fit the logistic regression model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(df, target)
# Make predictions on the dataset
predictions = model.predict_proba(df)[:, 1]
# Compute precision and recall values
precision, recall, thresholds = precision_recall_curve(target, predictions)
# Plot the Precision-Recall curve
plt.plot(recall, precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()
Wrap up
In this tutorial, we learned how to create a Precision-Recall curve in Python for a logistic regression model with four predictor variables. We first generated a dataset with four predictor variables and a target variable, and fitted a logistic regression model using scikit-learn’s LogisticRegression
class.
We then used the predict_proba
method to obtain predicted probabilities for the positive class, and the precision_recall_curve
function to compute the precision and recall values for different probability thresholds. Finally, we used matplotlib
to visualize the Precision-Recall curve.
The resulting curve shows the trade-off between precision and recall for different probability thresholds, and can be used to evaluate the performance of the logistic regression model. By examining the curve, we can choose a threshold that balances precision and recall according to our needs.
To learn more about Precision-Recall Curve check out the:
https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
Thanks for reading. Happy coding!