In the real world, having a perfect database is nearly impossible. During data cleaning and preparation for analysis and modeling, addressing missing data is almost always necessary.
Operations for handling missing data are executed with remarkable ease in Python, primarily due to the powerful functions provided by certain libraries, most notably Pandas. This popular data manipulation library offers a comprehensive suite of tools specifically designed for efficiently managing and processing missing values.
Searching for Missing Values
In a Pandas DataFrame, missing data are identified as NaN (Not a Number).
To check for missing values in a dataframe, use the isna() or isnull() methods. These return a boolean DataFrame where True indicates missing data.
Conversely, the notna() and notnull() methods mark non-missing values as True.
The info() method also provides a summary of the dataframe, including details about missing data.
import pandas as pd
# Creating a sample DataFrame
data = {'Col1': [1, 2, None, 4], 'Col2': [None, 2, 3, 4]}
df = pd.DataFrame(data)
# Identifying Missing Values
print(df.isna())
print(df.isna().sum()) # Counts missing values for each column
print(df.info()) # Summary of information
Visual techniques can also be used to identify missing values, including heatmaps from libraries like matplotlib and seaborn, as well as matrices generated by the missingno library.
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
# heatmap of missing values
sns.heatmap(df.isnull(), cbar=False)
# matrix of missing values
msno.matrix(df)
plt.show()
How should we handle missing values?
To address missing values in a dataset for analysis, we have two main approaches: eliminating rows or columns containing missing data, or replacing missing values with estimated ones (imputation).
Removing Missing Values
Missing values can be removed from the dataset (or, more accurately, the corresponding rows or columns) if they are few in number and their elimination is unlikely to significantly impact the data analysis.
The Python commands for removing missing values are as follows:
• dropna(): removes rows and columns containing missing values
• df.dropna(axis=0): removes rows with missing values
• df.dropna(axis=1): removes columns with missing values
• thresh parameter: specifies the minimum number of missing values required for a row or column to be removed
# Removes rows with missing values
df_dropped_rows = df.dropna()
# Removes columns with missing values
df_dropped_cols = df.dropna(axis=1)
Generally, if less than 5% of values are missing, we can remove the corresponding rows or columns. However, with a higher percentage of missing values, we must be more cautious, as removing substantial data could skew results. For instances with up to 20% missing data, it’s better to use imputation techniques. Yet, when over 40-50% of the data is missing, even advanced imputation techniques can’t ensure data integrity—in such cases, it’s advisable to eliminate the variable entirely.
However, the specific approach may vary depending on the nature of the data, the research question, and the context of the analysis.
Analyzing patterns of missing values (discussed below) is crucial for determining the best strategy. If missing values are not random (MAR or NMAR), their removal may introduce significant bias.
Imputing Missing Values
In Python, missing values are replaced with estimated values using the fillna() method. Simple imputation techniques include replacing missing data with means, medians, or modes, using fixed values, or applying interpolation methods (such as linear or polynomial).
# Replacement with the mean
df['Col1'].fillna(df['Col1'].mean(), inplace=True)
# Replacement with the median
df['Col1'].fillna(df['Col1'].median(), inplace=True)
# Replacement with the mode
df['Col1'].fillna(df['Col1'].mode()[0], inplace=True)
# Replacement with a fixed value
df['Col1'].fillna(0, inplace=True)
# Replacement with interpolation
df.interpolate(method='linear', inplace=True)
For improved accuracy, imputation techniques can be more sophisticated. Advanced predictive models—such as linear regression, K-Nearest Neighbors (KNN), or decision trees—are frequently employed to estimate missing values based on other known variables in the dataset.
For instance, you can employ Scikit-Learn’s IterativeImputer class for sophisticated, model-based imputation.
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import numpy as np
imp = IterativeImputer(max_iter=10, random_state=0)
df_imputed = pd.DataFrame(imp.fit_transform(df), columns=df.columns)
Consider using KNN imputation to replace missing values with estimates based on similar observations.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=2)
df_imputed_knn = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
Missing Data Patterns
Missing data patterns typically fall into three categories: Missing Completely at Random (MCAR), where data is randomly absent; Missing at Random (MAR), where absence is influenced by other variables; and Not Missing at Random (NMAR), where the absence is non-random and potentially related to the missing value itself.
MCAR (Missing Completely at Random) data are entirely independent, both from other variables and from the values of the variable itself. To assess the presence of this type of missing data, researchers often use Little’s test. This statistical test operates under the null hypothesis that the missing data are MCAR. If Little’s statistic—which follows a Chi-square distribution—is significant, it suggests that the missing data are not MCAR.
from sklearn.impute import SimpleImputer
from statsmodels.imputation import mice
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
import statsmodels.api as sm
# Creation of a sample dataset
data = {'A': [1, 2, np.nan, 4, 5], 'B': [5, np.nan, 1, 2, 4], 'C': [np.nan, 2, 3, 4, 5]}
df = pd.DataFrame(data)
# Performs Little's test
imp = IterativeImputer(max_iter=10, random_state=0)
df_imputed = pd.DataFrame(imp.fit_transform(df), columns=df.columns)
mice = sm.MICEData(df)
little_test = mice.test_mcar()
print(little_test)
MAR (Missing At Random) data depend on the values of other variables in the dataset, but not on the values within the column where the data is missing.
To determine if data are MAR, we can analyze relationships between variables. A significant association between missing values in one variable and other variables may indicate MAR data.
NMAR (Not Missing At Random) data missingness depends on the variable in which the data is absent. For example, higher incomes might go unreported because wealthy individuals are often reluctant to disclose their earnings.
Unlike MCAR (Missing Completely at Random) data, which can be identified using Little’s test, there are no simple statistical tests for MAR (Missing At Random) or NMAR (Not Missing At Random) patterns. Instead, researchers and data analysts must rely on domain expertise and advanced visualization techniques to identify and characterize missing data. This approach requires a deep understanding of the subject matter, careful analysis of data relationships, and the use of sophisticated graphical tools to reveal patterns or dependencies in missing data. By employing these methods, analysts can uncover crucial insights into the mechanisms of data missingness, which is essential for choosing appropriate imputation strategies and ensuring the validity of subsequent analyses.
MCAR data can be handled by removing rows or columns with missing data or through simple imputation (using mean, median, or mode, for example). As these data are completely random, it’s unlikely that such methods will introduce bias.
MAR data, however, depend on other variables in the dataset. To estimate these values, we must leverage the information contained in these related variables. Thus, it’s preferable to use predictive models such as Multiple Imputation by Chained Equations (MICE), KNN imputation, or regression models.
NMAR data present the most complex challenge, as the probability of a value being missing depends on the value itself. Attempts to estimate NMAR can involve sensitivity models, Bayesian methods, or direct modeling of the data.
In the Data Cleaning process, the management of Missing Data plays a crucial role in ensuring the integrity and reliability of datasets. While some instances of missing data can be easily identified and addressed through straightforward methods, many situations demand more sophisticated and nuanced approaches. The complexity of handling missing data often stems from the varied nature of its occurrence, ranging from completely random absences to patterns that are intricately linked with other variables or the missing values themselves. Consequently, data scientists and analysts must employ a diverse array of techniques, from simple imputation methods to advanced statistical models, to effectively manage missing data. This process not only involves filling in gaps but also understanding the underlying mechanisms that led to the data’s absence, which is essential for maintaining the validity of subsequent analyses and drawing accurate conclusions from the dataset.