Logistic Regression 2 - micheledpierri.com

← Continued from previous page

Analyzing Output and Results: -2 Log-Likelihood and Omnibus test

Logistic regression analysis typically follows a stepwise approach. Here’s a more detailed explanation of the process:

Starting Point: Baseline Model

We begin with a simple model that includes only the constant term. This serves as our starting point for comparison.

Variable Inclusion Methods

There are several ways to add variables to the model:

‘Enter’ method: All variables are added at once.
Stepwise methods: Variables are added or removed one at a time based on statistical criteria.

Model Evaluation: -2 Log-Likelihood (-2LL) Test

At each step, we calculate the -2LL test. This measures how well the model’s predictions match the observed data. A decrease in -2LL indicates improved model fit.

Omnibus Test

This test compares the -2LL of the initial model to that of the final model. It helps us determine if adding variables has significantly improved the model’s predictive power. A p-value less than 0.05 suggests meaningful improvement.

Iterative Process

We continue this process of adding variables and evaluating the model until we find the right balance between model complexity and accuracy. This approach allows us to carefully examine each variable’s impact and build a robust predictive model.

Classification Table

IIn addition to the Omnibus test (which assesses the model’s fit with independent variables compared to a null model), another crucial evaluation relates to the accuracy of the model’s predictions.

The classification table displays the following for the model:

True positive cases: Instances correctly identified as class 1 by both the dataset and the model
True negative cases: Instances correctly identified as class 0 by both the dataset and the model
False positive cases: Instances classified as 0 in the dataset but 1 by the model
False negative cases: Instances classified as 1 in the dataset but 0 by the model

Using these data, we can calculate the model’s prediction accuracy (percentage of correctly classified observations), sensitivity, and specificity.

Visualizing Output and Results: Graphical Representations

To visually represent our regression model’s performance, we can use several graphical methods.

The Probability Plot displays observations along the X-axis, which represents probability from 0 to 1. The model labels each observation as 0 or 1. In an ideal model, observations cluster at the axis extremes—those labeled 0 near the 0 end, and those labeled 1 near the 1 end. This creates a characteristic U-shaped curve.

The ROC (Receiver Operating Characteristic) curve, meanwhile, illustrates the trade-off between sensitivity (true positive rate) and specificity (true negative rate) at various threshold settings. The area under this curve (AUC) provides a comprehensive measure of the model’s performance.

Analyzing Output and Results: probability and group prediction

In logistic regression, two crucial variables can be generated beyond the coefficient summary: predicted probability and group membership. These aren’t mere model outputs; they can be added as new columns to the original dataset, offering case-by-case insights.

Predicted probability: This variable represents the model’s estimated probability for each observation to belong to the category of interest (typically coded as 1). Values range from 0 to 1.
Group membership: This variable categorizes each observation based on its predicted probability. A threshold (often 0.5) is used to assign observations to one of two groups. Observations with predicted probabilities above the threshold are assigned to group 1; otherwise, they’re assigned to group 0.

These variables are particularly valuable for:

Assessing model performance at the individual observation level
Identifying noteworthy or anomalous cases
Creating informative visualizations, such as classification plots

By incorporating these columns into the original dataset, researchers can conduct more nuanced analyses and better interpret logistic model results.

Analyzing Output and Results: Influences

Influences estimate the impact of each single observation on the outcome. Essentially, they show how the regression coefficient would change if a specific observation (e.g., a particular patient) were excluded from the analysis.

Influences help identify outliers (influential observations or leverage points) and assess their effect on the analysis. Several methods exist to evaluate these influences:

DFBETAS: Measures the change in coefficient if a specific observation were removed
DFFITS: Assesses the change in the model’s prediction if a specific observation were removed
Leverage: Indicates how far an observation deviates from other observations
Cook’s distance: Combines aspects of DFBETAS and Leverage

Measuring influences isn’t always necessary in logistic regression. It’s particularly useful when Exploratory Data Analysis graphs reveal extreme values, in models with numerous independent variables, or in cases with few observations where a single extreme value could significantly impact the analysis.

Analyzing Output and Results: Residuals

In logistic regression, evaluating residuals is one method to assess the model’s goodness of fit. Residuals represent the difference between observed values and those predicted by the model.

There are several ways to calculate residuals:

Unstandardized residuals: The raw difference between observed and predicted values, without further processing.
Standardized residuals: Unstandardized residuals normalized by dividing them by the standard deviation of the residuals.
Deviance residuals: The difference between observed and predicted logits, representing the differences in log-odds.
Studentized residuals: Another form of standardized residuals, divided by an estimated variance value.

Residuals help us better understand the model. Though not essential, they’re useful for evaluating the model’s fit and detecting anomalies.

Typically, it’s sufficient to start by examining unstandardized residuals to identify potential issues before considering other types.

Analyzing Output and Results: Hosmer and Lemeshow Test

The Hosmer-Lemeshow test compares the model’s predicted probabilities with the actual frequencies observed in the dataset. Typically, the dataset is divided into groups, and for each group, the ratio of observed to predicted events is examined.

This analysis produces a chi-square test. If the test is not significant, it indicates that the predicted numbers closely match the observed numbers, suggesting the model has good predictive capacity. Conversely, a significant test result implies a discrepancy between observations and predictions, indicating that the model doesn’t perform well.

This test is thus used to assess the model’s goodness of fit.

Logistic Regression: Comparing Traditional Statistical and Machine Learning Approaches

When performing logistic regression using a traditional statistical approach, as commonly seen in econometrics or medical research, the entire dataset is analyzed as a whole. The primary goal here isn’t prediction, but rather inference—understanding the relationship between independent and dependent variables. Coefficient analysis is key to interpreting these relationships.

In contrast, Machine Learning focuses on creating models that predict dependent variables based on unseen independent variables. This approach typically involves splitting the dataset into training and test sets. The model is developed using the training set and then evaluated on the unseen test set. The goal is to build a model that can generalize knowledge from the dataset and apply it to new, unseen data. Common model metrics include accuracy, ROC curves, and precision-recall.

Logistic Regression in Python: The Traditional Statistical Approach

In the following section, we’ll use Python to construct a comprehensive logistic regression analysis using a traditional statistical approach. We’ll explore the alternative approach in the website’s Machine Learning section.

Import library:

import numpy as np
import pandas as pd
import statsmodels.api as sm
from scipy.stats import chi2
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score

Create a synthetic dataset using age, blood pressure, and cholesterol as independent variables, with heart disease as the dependent variable.

# Generate a synthetic dataset
np.random.seed(0)
n_samples = 500
age = np.random.randint(20, 80, size=n_samples)
blood_pressure = np.random.normal(120, 15, n_samples).astype(int)
cholesterol = np.random.normal(200, 50, n_samples).astype(int)

# Dependent variable (binary outcome)
odds = -5 + 0.05 * age + 0.03 * blood_pressure + 0.01 * cholesterol
probability = 1 / (1 + np.exp(-odds))
heart_disease = np.random.binomial(1, probability)

# Creating the DataFrame
df = pd.DataFrame({
    'Age': age,
    'Blood_Pressure': blood_pressure,
    'Cholesterol': cholesterol,
    'Heart_Disease': heart_disease
})

df

Fit the logistic regression using all data

# Fit the logistic regression model using all data
X = sm.add_constant(df[['Age', 'Blood_Pressure', 'Cholesterol']])
y = df['Heart_Disease']
logit_model = sm.Logit(y, X).fit(disp=False)

Display model summary with coefficients, p values and -2LL

print(logit_model.summary())
# Extract the log-likelihood and calculate -2 Log-Likelihood
log_likelihood = logit_model.llf
neg2_log_likelihood = -2 * log_likelihood
print(f"\nLog-Likelihood: {log_likelihood}")
print(f"-2 Log-Likelihood: {neg2_log_likelihood}")

Hosmer-Lemeshow test to assess the goodness of fit

predicted_probabilities = logit_model.predict(X)
df['predicted'] = predicted_probabilities
df['group'] = pd.qcut(df['predicted'], 10, labels=False)

# Calculating observed and expected frequencies in each decile
observed = df.groupby('group')['Heart_Disease'].sum()
expected = df.groupby('group')['predicted'].sum()
group_size = df.groupby('group').size()

# Calculating Hosmer-Lemeshow statistic
hl_statistic = ((observed - expected) ** 2 / expected + 
                (group_size - observed - (group_size - expected)) ** 2 / (group_size - expected)).sum()
df_hl = 10 - 2
p_value = 1 - chi2.cdf(hl_statistic, df_hl)

print(f"\nHosmer-Lemeshow Test:\nChi-square: {hl_statistic}\np-value: {p_value}")

Classification Plot

plt.figure(figsize=(10, 6))
plt.hist(df['predicted'][df['Heart_Disease'] == 0], bins=10, alpha=0.5, label='Class 0 (No Disease)', color='blue')
plt.hist(df['predicted'][df['Heart_Disease'] == 1], bins=10, alpha=0.5, label='Class 1 (Disease Present)', color='red')
plt.xlabel("Predicted Probability")
plt.ylabel("Frequency")
plt.title("Classification Plot - Predicted Probability Distribution by Class")
plt.legend(loc="upper center")
plt.grid(True)
plt.show()

Roc curve

# Predict probabilities for the ROC curve
predicted_probabilities = logit_model.predict(X)

# Calculate ROC curve and AUC
fpr, tpr, _ = roc_curve(y, predicted_probabilities)
roc_auc = roc_auc_score(y, predicted_probabilities)

# Plot the ROC curve
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='blue', label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='grey', linestyle='--')  # Diagonal line for reference
plt.xlabel("False Positive Rate") 
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Logistic Regression Model")
plt.legend(loc="lower right")
plt.grid(True)
plt.show()

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.