ANOVA (Analysis of Variance) is a statistical technique used to compare the means of three or more groups and determine if the differences are statistically significant.
For groups larger than two, using multiple t-tests would require numerous pairwise comparisons, increasing the risk of Type I errors (false positives). ANOVA solves this problem by applying a single test to all groups simultaneously.
Ronald A. Fisher, an English statistician working at an agricultural research center, developed ANOVA to conduct complex experiments. His technique allowed for the partitioning of observed variability by attributing it to different sources.
There are several types of ANOVA:
One-way ANOVA: This is used when there’s a single continuous dependent variable and a discrete independent variable with more than two values. For example, comparing the effects of three or more dosages of an antihypertensive drug on mean arterial pressure. The groups (in this case, individuals taking different drug dosages) must be independent of each other.
Two-way ANOVA: This applies when multiple discrete variables influence the continuous dependent variable. An example would be evaluating the impact of both diet and physical exercise on a health outcome. As with one-way ANOVA, the groups must be independent.
Repeated measures ANOVA: This is used when the groups are not independent. For instance, measuring blood glucose levels in the same individuals at different intervals after starting various antidiabetic treatments. This allows for the evaluation of treatment effects over time within subjects.
One-way ANOVA
To apply the ANOVA test, certain requirements must be met:
- Normality: The values of the dependent variable must be normally distributed within the groups.
- Homoscedasticity: The variances within the groups must be equal.
- Independence: The observations must be independent of one another.
These requirements can be verified through appropriate statistical tests. For example, normality can be checked both graphically (Q-Q plot) and computationally (Kolmogorov-Smirnov test or Shapiro-Wilk test). Homoscedasticity can be verified with Levene’s test. Independence must be assessed based on the study design and its execution.
The fundamental principle underlying ANOVA is the comparison of the total variability of the dependent variable both between and within groups. This analysis relies on a ratio known as the F statistic, computed by dividing the variance observed between groups by the variance found within groups. The interpretation of this F value is key to understanding the results of ANOVA. When the F value is high, it suggests that the differences observed between the groups are more substantial than those found within the groups themselves. This scenario typically indicates a statistically significant difference among the groups being studied.
To elaborate further, the between-group variance represents the variability of group means around the overall mean, while the within-group variance reflects the average variability of individual observations within each group. By comparing these two sources of variance, ANOVA effectively assesses whether the differences among group means are greater than what would be expected by chance alone. A larger F value, therefore, implies that the independent variable (the factor defining the groups) has a more pronounced effect on the dependent variable, as it accounts for a greater proportion of the total variability in the data.
The null hypothesis (H0) of the ANOVA test states that there are no significant differences between the groups. The alternative hypothesis (H1) suggests that significant differences exist between the groups, with at least one group differing from the others. However, the initial test does not identify which specific group or groups are different.
Post-hoc tests
To address one of the ANOVA test’s limitations—its inability to indicate which group differs significantly from others—researchers can perform post-hoc tests. These tests compare pairs of groups, identifying specific differences.
The choice of post-hoc test depends on the homogeneity of variance between groups, which is determined using Levene’s test.
For homogeneous variance (non-significant Levene’s test), options include:
- Tukey’s Honestly Significant Difference (Tukey HSD): Widely used and effective
- Scheffé’s test: Less powerful than Tukey’s HSD but more conservative, reducing false positives
- Duncan’s test: A balanced approach between Tukey’s and Scheffé’s tests
For non-homogeneous variance (significant Levene’s test), researchers can apply:
- Games-Howell test: Robust for unequal variances and different group sizes
- Dunnett’s T3 test
- Tamhane’s T2 test
Typically, Tukey’s HSD is preferred for homogeneous variance, while the Games-Howell test is favored when variance is not homogeneous.
Contrast analysis
Contrast analysis is another technique used when comparing multiple groups. After an ANOVA test reveals significant differences between groups, researchers may want to explore the relationships among specific groups. This analysis, known as contrast, offers more precision than generic post-hoc tests.
To illustrate, consider a study testing various drugs (groups, independent variable) for treating arterial hypertension (target). An ANOVA test might show that the drugs are effective overall, but it won’t indicate which drug is most effective. Post-hoc tests can compare drugs in pairs to identify the most effective ones. However, these tests compare all groups, increasing the number of comparisons and the risk of Type I errors.
Contrast analysis allows researchers to propose a specific hypothesis in advance—for instance, that one drug is more effective than the others. By reducing the number of comparisons, this approach yields greater statistical power than post-hoc tests.
Contrast analysis involves assigning coefficients to groups to emphasize the effectiveness of specific groups while diminishing others. The sum of these coefficients must equal 0.
Consider a study testing drugs A, B, C, and D for blood pressure. To assess drug A’s effectiveness against the others, we might assign it a coefficient of 3 and the others -1 each (3, -1, -1, -1, summing to zero). Alternatively, to compare the efficacy of drugs A and B against C and D, we could assign coefficients of 1 to A and B, and -1 to C and D (1, 1, -1, -1, also summing to zero).
This analysis yields an F statistic, similar to the one previously described, based on the contrast’s variance.
Bonferroni test
The Bonferroni test, often mentioned among post-hoc tests, is more accurately described as a correction. It adjusts p-value thresholds to determine significance in multiple comparisons.
Consider a study testing five drugs for hypertension control. While an ANOVA test might indicate overall effectiveness, it doesn’t specify which drug is effective. This necessitates a post-hoc test with multiple pairwise comparisons—in this case, 10 comparisons.
If we’ve set a p-value threshold of 0.05, the Bonferroni correction would only deem significant those comparisons with a p-value below 0.05/10 = 0.005.
Essentially, the Bonferroni correction divides the p-value by the number of comparisons:
The Bonferroni correction is useful for multiple comparisons to prevent “Type I error inflation.” However, it comes with a trade-off: by lowering the significance level, it reduces statistical power, increasing the risk of Type II errors. This issue becomes more pronounced as the number of comparisons grows, resulting in an extremely small Bonferroni alpha value.
Consequently, researchers often prefer alternative post-hoc tests that achieve a better balance between statistical power and error control.
Below is a Python program that performs one-way ANOVA. The code creates a database with four treatment groups (A, B, C, D) to evaluate their effectiveness on a specific outcome. After visualizing the data, the program conducts the ANOVA test using SciPy, performs a post-hoc analysis using Tukey’s HSD (via statsmodels), and carries out a contrast analysis. This analysis compares treatment A against B, C, and D, and then compares A and B versus C and D. Finally, the code includes statistical techniques to verify the assumptions of normality and homoscedasticity.
import pandas as pd
import numpy as np
import scipy.stats as stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as mc
import matplotlib.pyplot as plt
import seaborn as sns
# Generating a synthetic medical dataset
np.random.seed(42)
data = {
'Treatment': np.repeat(['A', 'B', 'C', 'D'], 30),
'Outcome': np.concatenate([
np.random.normal(5, 1, 30), # Group A
np.random.normal(6, 1.2, 30), # Group B
np.random.normal(7, 1.5, 30), # Group C
np.random.normal(6.5, 1.3, 30) # Group D
])
}
df = pd.DataFrame(data)
# Visualization of Data with Boxplot
plt.figure(figsize=(10, 6))
sns.boxplot(x='Treatment', y='Outcome', data=df)
plt.title('Outcome by Treatment Group')
plt.xlabel('Treatment Group')
plt.ylabel('Outcome')
plt.show()
# One-Way ANOVA
groups = [df[df['Treatment'] == group]['Outcome'] for group in df['Treatment'].unique()]
F_stat, p_value = stats.f_oneway(*groups)
print(f'ANOVA F-statistic: {F_stat:.2f}, p-value: {p_value:.4f}')
# Interpreting ANOVA results
if p_value < 0.05:
print("There is a significant difference between treatment groups.")
else:
print("No significant difference found between treatment groups.")
# Post-hoc Analysis (Tukey's HSD)
comp = mc.MultiComparison(df['Outcome'], df['Treatment'])
tukey_result = comp.tukeyhsd()
print(tukey_result)
# Plotting Tukey HSD results
tukey_result.plot_simultaneous(figsize=(10, 6))
plt.title("Tukey's HSD - Confidence Intervals for Mean Differences")
plt.xlabel('Mean Difference')
plt.show()
# Contrasts Analysis
# Define contrasts: Compare Group A vs others, and Group A & B vs Group C & D
contrast_matrix = {
'A_vs_Others': [3, -1, -1, -1],
'A_B_vs_C_D': [1, 1, -1, -1]
}
# Performing contrast analysis using statsmodels
for contrast_name, contrast_weights in contrast_matrix.items():
print(f"\nContrast: {contrast_name}")
# Adding contrasts to the OLS model
ols_model = smf.ols('Outcome ~ C(Treatment)', data=df).fit()
contrast_result = ols_model.t_test(contrast_weights)
print(contrast_result.summary())
# Residual Analysis to check for ANOVA assumptions
# Residual Plot
ols_model = smf.ols('Outcome ~ C(Treatment)', data=df).fit()
residuals = ols_model.resid
fitted_values = ols_model.fittedvalues
plt.figure(figsize=(10, 6))
sns.residplot(x=fitted_values, y=residuals, lowess=True, line_kws={'color': 'red', 'lw': 2})
plt.axhline(0, linestyle='--', color='black')
plt.title('Residual Plot')
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.show()
# Checking normality of residuals with histogram
plt.figure(figsize=(10, 6))
sns.histplot(residuals, kde=True)
plt.title('Histogram of Residuals')
plt.xlabel('Residuals')
plt.ylabel('Density')
plt.show()
# Shapiro-Wilk test for normality of residuals
shapiro_test = stats.shapiro(residuals)
print(f'Shapiro-Wilk test statistic: {shapiro_test.statistic:.4f}, p-value: {shapiro_test.pvalue:.4f}')
if shapiro_test.pvalue < 0.05:
print("Residuals are not normally distributed.")
else:
print("Residuals are normally distributed.")
# Test for Homogeneity of Variances (Levene's Test)
levene_test = stats.levene(*groups)
print(f'Levene test statistic: {levene_test.statistic:.4f}, p-value: {levene_test.pvalue:.4f}')
if levene_test.pvalue < 0.05:
print("The variances are significantly different between groups (no homogeneity). Consider using alternative tests.")
else:
print("The variances are homogeneous across the groups.")