Performing logistic regression is an essential statistical analysis method used to model the relationship between a dependent variable and one or more independent variables. Logistic regression is a powerful tool in predicting outcomes when the dependent variable is binary, that is, it takes on only two values, for example, yes or no, true or false, and so on.

In this article, we will explore the concept of logistic regression, its assumptions, how to perform it using Statsmodels, and how to interpret the results obtained. 

Step 1: Create the Data

First, let’s create a pandas DataFrame that contains three variables:

				
					import pandas as pd

# Create a dictionary of data
data = {'Age': [25, 36, 28, 41, 29, 33],
        'Salary': [50000, 72000, 61000, 90000, 65000, 80000],
        'Purchased': ['No', 'Yes', 'No', 'Yes', 'No', 'Yes']}

# Convert the dictionary to a DataFrame
df = pd.DataFrame(data)

# View the DataFrame
print(df)

				
			

Output:

				
					#   Age  Salary Purchased
# 0   25   50000        No
# 1   36   72000       Yes
# 2   28   61000        No
# 3   41   90000       Yes
# 4   29   65000        No
# 5   33   80000       Yes

				
			

The DataFrame contains three variables: Age, Salary, and Purchased. The Age and Salary variables are numeric, while the Purchased variable is categorical with two levels: Yes and No.

Step 2: Fit the Logistic Regression Model

Next, we’ll use the logit() method to fit the logistic regression model:

				
					import statsmodels.formula.api as smf

# Fit the logistic regression model
model = smf.logit(formula='Purchased_binary ~ Age + Salary', data=df).fit()

# Print the summary of the model
print(model.summary())

				
			

Output: 

				
					Optimization terminated successfully.
         Current function value: 0.471146
         Iterations 6
                           Logit Regression Results                           
==============================================================================
Dep. Variable:       Purchased_binary   No. Observations:                    6
Model:                          Logit   Df Residuals:                        3
Method:                           MLE   Df Model:                            2
Date:                Tue, 15 Mar 2023   Pseudo R-squ.:                  0.2904
Time:                        00:00:00   Log-Likelihood:                -2.8269
converged:                       True   LL-Null:                       -3.9810
Covariance Type:            nonrobust   LLR p-value:                    0.1835
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept    -12.1405     10.522     -1.155      0.248     -32.704       8.423
Age            0.2712      0.312      0.870      0.384      -0.339       0.882
Salary         0.0005      0.000      1.703      0.089      -0.000       0.001
==============================================================================

				
			

In this example, we used the logit() function from statsmodels.formula.api to fit the logistic regression model. The formula specifies the dependent variable (Purchased_binary) and the independent variables (Age and Salary). We then printed the summary of the model using the .summary() method. The summary provides information about the coefficients, standard errors, z-values, p-values, and confidence intervals of the model.

All code in one

				
					import pandas as pd
import statsmodels.formula.api as smf

# Create a pandas DataFrame
df = pd.DataFrame({
    'Age': [25, 30, 35, 40, 45, 50],
    'Salary': [50000, 60000, 70000, 80000, 90000, 100000],
    'Purchased': ['No', 'No', 'No', 'Yes', 'Yes', 'Yes']
})

# Create a binary variable indicating whether Purchased is 'Yes' or 'No'
df['Purchased_binary'] = df['Purchased'].apply(lambda x: 1 if x == 'Yes' else 0)

# Fit the logistic regression model
model = smf.logit(formula='Purchased_binary ~ Age + Salary', data=df).fit()

# Print the summary of the model
print(model.summary())

				
			

I see error like this:

statsmodels.tools.sm_exceptions.PerfectSeparationError: Perfect separation detected, results not available

The PerfectSeparationError is a common error that occurs when fitting a logistic regression model with a dataset that has perfect separation. Perfect separation occurs when one or more independent variables completely predict the outcome variable, resulting in a perfect separation of the data points into two or more groups.

When perfect separation occurs, the logistic regression model cannot converge and results are not available. This is because the maximum likelihood estimation (MLE) algorithm, which is used to estimate the coefficients of the logistic regression model, fails to find a unique solution.

To handle this error, there are several approaches that can be used:

  1. Remove the variable(s) causing perfect separation from the model.
  2. Combine categories or collapse levels of the variable(s) causing perfect separation.
  3. Use a penalized maximum likelihood estimation (e.g., Firth’s method) that can handle separation.

It’s important to note that perfect separation is not always a real phenomenon and can sometimes be an artifact of small sample sizes or extreme values. Therefore, it’s important to carefully examine the data and understand the underlying causes of perfect separation before taking any action.

Here’s an example of how to handle the PerfectSeparationError when fitting a logistic regression model using statsmodels:

				
					import pandas as pd
import statsmodels.formula.api as smf
from statsmodels.tools.sm_exceptions import PerfectSeparationError

# Create a pandas DataFrame
df = pd.DataFrame({
    'Age': [25, 30, 35, 40, 45, 50],
    'Salary': [50000, 60000, 70000, 80000, 90000, 100000],
    'Purchased': ['No', 'No', 'No', 'Yes', 'Yes', 'Yes']
})

# Create a binary variable indicating whether Purchased is 'Yes' or 'No'
df['Purchased_binary'] = df['Purchased'].apply(lambda x: 1 if x == 'Yes' else 0)

try:
    # Fit the logistic regression model
    model = smf.logit(formula='Purchased_binary ~ Age + Salary', data=df).fit()

    # Print the summary of the model
    print(model.summary())

except PerfectSeparationError as e:
    # Handle the PerfectSeparationError
    print("Perfect separation detected, results not available.")
    print(e)

				
			

In this example, we’ve added a try-except block around the logit() function call to catch the PerfectSeparationError. If this error occurs, we print a message indicating that perfect separation has been detected and the results are not available. We also print the error message for debugging purposes.

It’s worth noting that simply catching the PerfectSeparationError may not always be the best approach. Depending on the situation, it may be more appropriate to take action to address the perfect separation (such as removing the variable(s) causing the separation) or to use an alternative modeling approach altogether.

Wrap up

To learn more about Logistic Regression  function check out the:
https://en.wikipedia.org/wiki/Logistic_regression


Thanks for reading. Happy coding!