# Logistic Regression

By Chi Kit Yeung in Statistics Python Machine Learning

August 17, 2024

# Introduction

Logistic Regression is a form of supervised machine learning where we try to predict a categorical dependent variable, using one or more independent variables. In other words, it is a model that predicts the probability of a binary outcome (between 1 or 0, True or False).

Brainstorming some problems that may be answered using logistic regression:

- Will the customer buy anything?
- Will it rain?
- Will this student get admitted?
- Predict if person will land a data scientist job

I realized that the example problems all expect a ‘Yes’ or ‘No’ answer but the prediction is more of a ‘True’ or ‘False’ categorization.

## Assumptions

The logistic regression shares all the same assumptions as with the linear regression assumptions aside from the first ’linearity’ assumption.

## Maximum Likelihood Estimation (MLE)

MLE is the method used to estimate the parameter values of the logistic regression model. What this means in layman terms is it’s the method used to find the logistic function that best explains the data. The method runs through multiple iterations of a likelihood function until it finds a function that returns the maximum likelihood estimation.

### Log-likelihood

- The value of Log-likelihood is almost always negative
- The higher the value, the better.

### LL-Null

This is the Log-likelihood value of a model with no independent variables. This value is useful as a comparison tool against the calculated Log-likelihood value to see if the model has any explanatory power.

### LLR p-value (Log Likelihood Ratio)

The value is based on the Log-likelihood value of the model and the LL-Null. It measures if the values are statistically different. The lower the better.

### Pseudo R-squared

Unlike the linear regression, the logistic regression does not have a statistic that can be likened to the R-squared. In this case, there is the Pseudo R-squared. A good Pseudo R-squared value is between 0.2 and 0.4.

# Performing the Logistic Regression

## Using StatsModel Package

Importing the packages

```
import pandas as pd
import statsmodels.api as sm
import numpy as np
```

Load the data

```
data = pd.read_csv('so_and_so.csv')
```

Declare the dependent and independent variables

```
y = data['y']
x1 = data['indep_A']
```

### The Regression Itself

```
x = sm.add_constant(x1)
model = sm.Logit(y, x)
results_log = model.fit()
```

Interpreting and evaluating the model by generating the summary table.

```
>>> results_log.summary()
"""
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 518
Model: Logit Df Residuals: 516
Method: MLE Df Model: 1
Date: Sat, 17 Aug 2024 Pseudo R-squ.: 0.2121
Time: 18:44:43 Log-Likelihood: -282.89
converged: True LL-Null: -359.05
Covariance Type: nonrobust LLR p-value: 5.387e-35
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
const -1.7001 0.192 -8.863 0.000 -2.076 -1.324
indep_A 0.0051 0.001 9.159 0.000 0.004 0.006
==============================================================================
"""
```

Create a model with multiple independent variables.

```
x1 = data[['indep_A', 'indep_B', 'indep_C', 'indep_D', 'indep_E']]
x = sm.add_constant(x1)
multi_model = sm.Logit(y, x)
multi_results = multi_model.fit()
```

```
>>> multi_results.summary()
"""
Logit Regression Results
==============================================================================
Dep. Variable: y No. Observations: 518
Model: Logit Df Residuals: 512
Method: MLE Df Model: 5
Date: Sat, 17 Aug 2024 Pseudo R-squ.: 0.5143
Time: 18:46:53 Log-Likelihood: -174.39
converged: True LL-Null: -359.05
Covariance Type: nonrobust LLR p-value: 1.211e-77
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
const -0.0211 0.311 -0.068 0.946 -0.631 0.589
indep_A 0.0070 0.001 9.381 0.000 0.006 0.008
indep_B -0.8001 0.089 -8.943 0.000 -0.975 -0.625
indep_C -1.8322 0.330 -5.556 0.000 -2.478 -1.186
indep_D 2.3585 1.088 2.169 0.030 0.227 4.490
indep_E 1.5363 0.501 3.067 0.002 0.554 2.518
=================================================================================
"""
```

### Generating Predictions

Predictions can be generated using the `predict`

method.

```
multi_results.predict(x)
```

`x`

is a dataframe with the same features used to train the model in the same order. Be sure to also add a constant column to the beginning of the dataframe using statsmodels’ addconstant function. The predictions will be returned as a Series of float values that represent the probability (eg. something like 0.787412). The results can be rounded to obtain the prediction as boolean values.

```
round(multi_results.predict(x))
```

## Using Scikit Learn

This was actually not included in the online course so I searched for the method independently.

```
# Importing the packages
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
```

```
# Define the dependent and independent variables
y = data['dependent']
x = data[['indep_A', 'indep_B', 'indep_C', 'indep_D', 'indep_E']]
```

```
# Scale and Perform the Regression
log = LogisticRegression()
scaler = StandardScaler()
scaler.fit(x)
x_scaled = scaler.transform(x)
log.fit(x_scaled, y)
```

```
# Generate Predictions
log.predict(x_scaled)
```

# Checking the Accuracy

## Confusion Matrix

```
def confusion_matrix(data,actual_values,model):
"""
Confusion matrix
Parameters
----------
data: data frame or array
data is a data frame formatted in the same way as your input data (without the actual values)
e.g. const, var1, var2, etc. Order is very important!
actual_values: data frame or array
These are the actual values from the test_data
In the case of a logistic regression, it should be a single column with 0s and 1s
model: a LogitResults object
this is the variable where you have the fitted model
e.g. results_log in this course
----------
"""
#Predict the values using the Logit model
pred_values = model.predict(data)
# Specify the bins
bins=np.array([0,0.5,1])
# Create a histogram, where if values are between 0 and 0.5 tell will be considered 0
# if they are between 0.5 and 1, they will be considered 1
cm = np.histogram2d(actual_values, pred_values, bins=bins)[0]
# Calculate the accuracy
accuracy = (cm[0,0]+cm[1,1])/cm.sum()
# Return the confusion matrix and
return cm, accuracy
```

[TBA]