Performing logistic regression is an essential statistical analysis method used to model the relationship between a dependent variable and one or more independent variables. Logistic regression is a powerful tool in predicting outcomes when the dependent variable is binary, that is, it takes on only two values, for example, yes or no, true or false, and so on.

In this article, we will explore the concept of logistic regression, its assumptions, how to perform it using Statsmodels, and how to interpret the results obtained.

## Step 1: Create the Data

First, let’s create a pandas DataFrame that contains three variables:

` ````
```import pandas as pd
# Create a dictionary of data
data = {'Age': [25, 36, 28, 41, 29, 33],
'Salary': [50000, 72000, 61000, 90000, 65000, 80000],
'Purchased': ['No', 'Yes', 'No', 'Yes', 'No', 'Yes']}
# Convert the dictionary to a DataFrame
df = pd.DataFrame(data)
# View the DataFrame
print(df)

Output:

` ````
```# Age Salary Purchased
# 0 25 50000 No
# 1 36 72000 Yes
# 2 28 61000 No
# 3 41 90000 Yes
# 4 29 65000 No
# 5 33 80000 Yes

The DataFrame contains three variables: Age, Salary, and Purchased. The Age and Salary variables are numeric, while the Purchased variable is categorical with two levels: Yes and No.

## Step 2: Fit the Logistic Regression Model

Next, we’ll use the logit() method to fit the logistic regression model:

` ````
```import statsmodels.formula.api as smf
# Fit the logistic regression model
model = smf.logit(formula='Purchased_binary ~ Age + Salary', data=df).fit()
# Print the summary of the model
print(model.summary())

Output:

` ````
```Optimization terminated successfully.
Current function value: 0.471146
Iterations 6
Logit Regression Results
==============================================================================
Dep. Variable: Purchased_binary No. Observations: 6
Model: Logit Df Residuals: 3
Method: MLE Df Model: 2
Date: Tue, 15 Mar 2023 Pseudo R-squ.: 0.2904
Time: 00:00:00 Log-Likelihood: -2.8269
converged: True LL-Null: -3.9810
Covariance Type: nonrobust LLR p-value: 0.1835
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
Intercept -12.1405 10.522 -1.155 0.248 -32.704 8.423
Age 0.2712 0.312 0.870 0.384 -0.339 0.882
Salary 0.0005 0.000 1.703 0.089 -0.000 0.001
==============================================================================

In this example, we used the `logit()`

function from statsmodels.formula.api to fit the logistic regression model. The formula specifies the dependent variable (`Purchased_binary`

) and the independent variables (`Age`

and `Salary`

). We then printed the summary of the model using the `.summary()`

method. The summary provides information about the coefficients, standard errors, z-values, p-values, and confidence intervals of the model.

## All code in one

` ````
```import pandas as pd
import statsmodels.formula.api as smf
# Create a pandas DataFrame
df = pd.DataFrame({
'Age': [25, 30, 35, 40, 45, 50],
'Salary': [50000, 60000, 70000, 80000, 90000, 100000],
'Purchased': ['No', 'No', 'No', 'Yes', 'Yes', 'Yes']
})
# Create a binary variable indicating whether Purchased is 'Yes' or 'No'
df['Purchased_binary'] = df['Purchased'].apply(lambda x: 1 if x == 'Yes' else 0)
# Fit the logistic regression model
model = smf.logit(formula='Purchased_binary ~ Age + Salary', data=df).fit()
# Print the summary of the model
print(model.summary())

## I see error like this:

statsmodels.tools.sm_exceptions.PerfectSeparationError: **Perfect separation detected, results not available**

The `PerfectSeparationError`

is a common error that occurs when fitting a logistic regression model with a dataset that has perfect separation. Perfect separation occurs when one or more independent variables completely predict the outcome variable, resulting in a perfect separation of the data points into two or more groups.

When perfect separation occurs, the logistic regression model cannot converge and results are not available. This is because the maximum likelihood estimation (MLE) algorithm, which is used to estimate the coefficients of the logistic regression model, fails to find a unique solution.

To handle this error, there are several approaches that can be used:

- Remove the variable(s) causing perfect separation from the model.
- Combine categories or collapse levels of the variable(s) causing perfect separation.
- Use a penalized maximum likelihood estimation (e.g., Firth’s method) that can handle separation.

It’s important to note that perfect separation is not always a real phenomenon and can sometimes be an artifact of small sample sizes or extreme values. Therefore, it’s important to carefully examine the data and understand the underlying causes of perfect separation before taking any action.

Here’s an example of how to handle the **PerfectSeparationError** when fitting a logistic regression model using statsmodels:

` ````
```import pandas as pd
import statsmodels.formula.api as smf
from statsmodels.tools.sm_exceptions import PerfectSeparationError
# Create a pandas DataFrame
df = pd.DataFrame({
'Age': [25, 30, 35, 40, 45, 50],
'Salary': [50000, 60000, 70000, 80000, 90000, 100000],
'Purchased': ['No', 'No', 'No', 'Yes', 'Yes', 'Yes']
})
# Create a binary variable indicating whether Purchased is 'Yes' or 'No'
df['Purchased_binary'] = df['Purchased'].apply(lambda x: 1 if x == 'Yes' else 0)
try:
# Fit the logistic regression model
model = smf.logit(formula='Purchased_binary ~ Age + Salary', data=df).fit()
# Print the summary of the model
print(model.summary())
except PerfectSeparationError as e:
# Handle the PerfectSeparationError
print("Perfect separation detected, results not available.")
print(e)

In this example, we’ve added a try-except block around the `logit()`

function call to catch the `PerfectSeparationError`

. If this error occurs, we print a message indicating that perfect separation has been detected and the results are not available. We also print the error message for debugging purposes.

It’s worth noting that simply catching the `PerfectSeparationError`

may not always be the best approach. Depending on the situation, it may be more appropriate to take action to address the perfect separation (such as removing the variable(s) causing the separation) or to use an alternative modeling approach altogether.

## Wrap up

To learn more about Logistic Regression function check out the:

https://en.wikipedia.org/wiki/Logistic_regression

Thanks for reading. Happy coding!