Introduction to Encoding
Machine learning algorithms use numerical data and cannot interpret categorical data, such as strings and alphanumeric characters. They perform mathematical calculations (addition, subtraction, distance calculation) that are inapplicable to categories like “red,” “big,” or “male.” To incorporate variables using these categories into calculations, we must convert them into numbers: this process is called Encoding of categorical variables.
If categorical variables remain untransformed, the model may either ignore or use them incorrectly, leading to distorted results. For instance, in a linear regression model, an untransformed categorical variable prevents the model from determining that variable’s impact on the target.
To assist with these operations, the Python library Scikit-Learn provides two types of encoders: LabelEncoder and OneHotEncoder.
LabelEncoder
The LabelEncoder function transforms categorical variables by assigning a unique number to each category.
For example, let’s consider a “color” variable that we want to convert into a numeric format.
from sklearn.preprocessing import LabelEncoder
data = ['red', 'green', 'blue', 'green', 'red', 'blue']
le = LabelEncoder()
encoded_data = le.fit_transform(data)
print(encoded_data)
​
Let’s examine the code in detail.
The first instruction:
le = LabelEncoder()
creates an instance of the LabelEncoder class. From this moment on, all the features (methods and attributes) of the LabelEncoder class will be available to the instance (le) and can be used independently.
The next instruction exposes the variable to be encoded to our LabelEncoder object, which we named “le”. The object will learn the number of categories in the data variable, assign a numerical value to each category, and then store the encoded data in the encoded_data variable.
encoded_data = le.fit_transform(data)
Here, we’ve combined two instructions: fit and transform. These could be used separately for clarity.
The fit method trains the “le” object to transform the categorical variable into numerical data without actually performing the transformation. It calculates and stores the necessary steps.
Once fit is executed, “le” can transform any variable containing the categories it was trained on.
The transform method then performs the actual conversion, turning categories into numbers.
If the variable used for training is the same one you want to transform, you can use fit_transform() to complete this process in a single step.
To clarify these concepts, let’s imagine we have three variables: color1, color2, and color3. We initialize an object from LabelEncoder and train it with color1. At this point, we can use it to transform color2 but not color3, which generates an error because it contains a category (“brown”) not included in the variable used for training.
color1 = ['red', 'green', 'blue', 'green', 'red', 'blue']
color2 = ['green', 'red', 'red', 'blue']
color3 = ['blue', 'brown', 'green', 'red', 'blue']
# Create an instance of LabelEncoder named color
color = LabelEncoder()
# Fit the model with color1
color.fit(color1)
# Apply the transformation to color2
color2_transformed = color.transform(color2)
print(color2_transformed)
# Output will be [1,2,2,0]
# Apply the transformation to color3
color3_transformed = color.transform(color3)
# Error message: ValueError: y contains previously unseen labels: 'bown'
LabelEncoder assigns unique numerical values to each category. For instance, if a variable contains three categories (such as three different colors), each category will be assigned a distinct numerical value (typically 0, 1, and 2).
However, numerical formats inherently imply an order (2 > 1 > 0). This transformation introduces a potential error if the categories don’t have a natural order—green isn’t “greater than” red or blue. The model might mistakenly interpret them as ordered.
Therefore, LabelEncoder is best applied to variables with only two categories or those that inherently have an order.
Despite this limitation, the encoder remains extremely easy to use.
LabelEncoder also offers a method for performing the inverse operation, transforming numerically encoded data back into its original categories. Here’s an example of how it’s applied:
data = ['red', 'green', 'blue', 'green', 'red', 'blue']
le = LabelEncoder()
data_encoded = le.fit_transform(data)
print(data_encoded) # output [2 1 0 1 2 0]
data_inverse = le.inverse_transform(data_encoded)
print(data_inverse)# output ['red' 'green' 'blue' 'green' 'red' 'blue']
After training (fit) the LabelEncoder, a useful attribute called classes_ is generated. This attribute displays the categories that were learned during training, arranged in alphabetical order. To illustrate this, let’s continue with our previous example:
print(le.classes_)
# output ['blue' 'green' 'red']
OneHotEncoder
OneHotEncoder transforms a categorical variable into a series of binary variables. The number of new variables equals the number of categories in the original variable, with each taking on a value of either 0 or 1.
from sklearn.preprocessing import OneHotEncoder
import numpy as np
data = np.array(['red', 'green', 'blue', 'green', 'red', 'blue']).reshape(-1,1)
ohe = OneHotEncoder(sparse_output=False)
encoded_data = ohe.fit_transform(data)
print(encoded_data)
In this case, transforming the variable into an array is necessary for OneHotEncoder to function properly. The result of the transformation will be:
[[0. 0. 1.] # red
[0. 1. 0.] # green
[1. 0. 0.] # blue
[0. 1. 0.] # green
[0. 0. 1.] # red
[1. 0. 0.]] # blue
Given the three colors in the original variable, the transformation produces a three-dimensional array. In this array, the first column corresponds to blue (1 for blue, 0 for other colors), the second to green, and the third to red.
OneHotEncoder allows for independent encoding of each category in the original variable. This prevents introducing a potential “order” in the variables that could compromise the Machine Learning model.
However, it also introduces several columns equal to the number of categories in the initial variable. This “dimensionality problem” can become significant—if there are 100 categories in the initial variable, 100 new columns will be added to the dataset, potentially making the model inefficient.
Multicollinearity
Another issue with OneHotEncoder is the potential creation of multicollinearity. Each category generates a variable represented by a column that takes a value of 1 if the category is present and 0 if absent. Consequently, knowing the values of all generated columns except one allows prediction of the last column’s value.
This phenomenon, where one column depends on the others, is known as “perfect multicollinearity.”
Multicollinearity can adversely affect Machine Learning models by causing coefficient instability in linear models, reducing model accuracy, and leading to overfitting.
To mitigate this issue, one approach is to remove one of the columns using the parameter drop=”first”.
ohe = OneHotEncoder(drop='first', sparse_output=False)
Dense or Sparse Output
OneHotEncoder offers another parameter that determines the output type: Dense or Sparse. A Dense output produces a complete matrix, displaying all values—including zeros. In contrast, a Sparse matrix only shows non-zero values.
By default, OneHotEncoder generates a sparse matrix (sparse_output=True). This approach is more memory-efficient but can be trickier to manipulate, particularly when converting results to a dataframe. For a dense output, you’ll need to set sparse_output=False.
OneHotEncoder also offers the inverse_transform() method, similar to LabelEncoder, which is useful for decoding data after model use.
To identify the learned categories, use the categories_ attribute. This produces an array, in contrast to the classes_ attribute in LabelEncoder, which generates a list.
data = np.array(['red', 'green', 'blue', 'green', 'red', 'blue']).reshape(-1,1)
ohe = OneHotEncoder(sparse_output=False)
ohe.fit(data)
print(ohe.categories_)
#output [array(['blue', 'green', 'red'], dtype='<U5')]
OneHotEncoder’s get_feature_names_out() method is another valuable tool. It returns the names of the newly generated columns, which is particularly useful when restructuring a dataframe to include these new columns.
new_columns = ohe.get_feature_names_out(["data"])
print(new_columns)
#output = ['data_blue' 'data_green' 'data_red']
Column Transformer
In preparing datasets for Machine Learning models, we often encounter multiple categorical variables that need numerical transformation using OneHotEncoder. To streamline this process and avoid repetitive transformations, we can employ ColumnTransformer. This tool allows us to apply different encoders to specific columns efficiently. By using ColumnTransformer, we create a pipeline that enables us to apply the desired encoder type to each specified column, simplifying our data preparation workflow.
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Definition of transformations to apply
transformer = ColumnTransformer(transformers=[
('onehot_color', OneHotEncoder(sparse_output=False), ['Color']),
('onehot_size', OneHotEncoder(sparse_output=False), ['Size'])
])
ColumnTransformer doesn’t directly support LabelEncoder, so you’ll need to apply it separately when required.
Encoders and Medical Datasets
Encoder techniques play a vital role in medical dataset analysis for several reasons.
Medical diagnostic coding relies heavily on hierarchical classification structures (ICD-10, SNOMED-CT). Since algorithms cannot process these structures directly, they must be transformed into a suitable format.
Medical databases frequently contain fields with nominal values lacking inherent order—such as blood types, genetic phenotypes, and surgical procedure codes. Encoding techniques prevent the unintended introduction of ordinal relationships between categories in these instances.
In medicine, ordinal rating scales—such as the Glasgow Coma Scale and APACHE II—are commonly used and can be effectively handled with LabelEncoding to maintain their inherent order.
However, applying encoding techniques to medical datasets presents specific challenges.
In particular, when medical datasets contain features with numerous distinct values, encoding can produce large sparse matrices that require dimensionality reduction.
Encoders must enable tracing back to the original data organization for proper model interpretation.
Encoding techniques should also respect the hierarchical orders typical of certain forms of coding.
When encoding medical datasets, comprehensive traceability systems and careful preservation of hierarchical relationships are essential.
Traceability requires several key mechanisms: bidirectional mapping dictionaries for forward and reverse transformations, detailed metadata documentation of transformation parameters, step-by-step transformation logs, and strategic retention of original columns for validation.
Hierarchical relationships must be maintained through specialized approaches: multilevel encoding schemes that respect medical classification structures, entity embedding techniques for complex category relationships, and carefully selected combinations of encoding methods that address specific data characteristics.
These practices ensure data integrity, reproducibility, and accurate interpretation of medical model results.
A Complete Guide to Encoder Implementation
Library imports
import pandas as pd # for managing datasets
from sklearn.preprocessing import LabelEncoder, OneHotEncoder # for encodings
from sklearn.compose import ColumnTransformer # for handling multiple columns with different encodings
Creating an example dataset
data = {
'Color': ['red', 'green', 'blue', 'yellow', 'green', 'red'],
'Size': ['small', 'large', 'medium', 'small', 'large', 'medium'],
'Category': ['A', 'B', 'A', 'C', 'B', 'A']
}
# We transform the data into a DataFrame
df = pd.DataFrame(data)
print(df)
Color Size Category
0 red small A
1 green large B
2 blue medium A
3 yellow small C
4 green large B
5 red medium A
Removing missing data (encoders don’t function with missing values)
# Check for missing values
print(df.isnull().sum())
# Replace missing values (if necessary)
df.fillna('unknown', inplace=True)
Applying LabelEncoder for Simple Encoding
# Initializing the LabelEncoder
le = LabelEncoder()
# Encoding the 'Color' column
df['Color_encoded'] = le.fit_transform(df['Color'])
print(df[['Color', 'Color_encoded']])
Color Color_encoded
0 red 3
1 green 2
2 blue 0
3 yellow 1
4 green 2
5 red 3
Appying OneHotEncoder for Simple Encoding
# Initializing the OneHotEncoder
ohe = OneHotEncoder(sparse_output=False)
# Encoding the 'Size' column
encoded = ohe.fit_transform(df[['Size']])
# Converting the result to a DataFrame
encoded_df = pd.DataFrame(encoded, columns=ohe.get_feature_names_out(['Size']))
print(encoded_df)
Size_large Size_medium Size_small
0 0.0 0.0 1.0
1 1.0 0.0 0.0
2 0.0 1.0 0.0
3 0.0 0.0 1.0
4 1.0 0.0 0.0
5 0.0 1.0 0.0
Demonstrating the Use of ColumnTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Definition of transformations to apply
transformer = ColumnTransformer(transformers=[
('onehot_color', OneHotEncoder(sparse_output=False), ['Color']),
('onehot_size', OneHotEncoder(sparse_output=False), ['Size']),
('label_category', LabelEncoder(), ['Category'])
])
# ColumnTransformer doesn't directly support LabelEncoder, so we do it manually.
df['Category_encoded'] = le.fit_transform(df['Category'])
# We apply OneHotEncoder to the rest
encoded_features = transformer.fit_transform(df[['Color', 'Size']])
# We convert the features into a DataFrame and merge everything
encoded_df = pd.DataFrame(encoded_features, columns=[
'Color_blue', 'Color_yellow', 'Color_red', 'Color_green',
'Size_large', 'Size_medium', 'Size_small'
])
final_df = pd.concat([df[['Category_encoded']], encoded_df], axis=1)
print(final_df)
Summary of encoders
Label Encoding | One-Hot Encoding | |
---|---|---|
Definition | Converts categories into integer labels | Converts categories into binary columns |
Output Format | Single integer column | Multiple binary columns, one for each category |
Best for | Ordinal data (where order matters) | Nominal data (where there is no intrinsic order) |
Memory Usage | Low, as it only creates one column | High, as it creates one column per category |
Risk of Multicollinearity | No, single column without redundant information | Yes, without dropping one category (use drop='first' option) |
Model Interpretation | Can mislead some models into thinking categories are ordered | Avoids ordinal assumptions in models |
Transformability | Works on a single column | Can work on multiple columns simultaneously |
Inverse Transformation | Possible with inverse_transform | Possible with inverse_transform |
Use Case Examples | Ordinal features like “small”, “medium”, “large” | Categorical features like “red”, “green”, “blue” |
Further Reading
TowardsDataScience: How to Encode Medical Records for Deep Learning
Diva-portal: Encoding Temporal Healthcare Data for Machine Learning
Conclusion
Encoders are essential tools in data preprocessing for machine learning. LabelEncoder excels at handling ordinal or binary variables, providing simple and memory-efficient encoding. OneHotEncoder, by contrast, is perfect for nominal variables that lack inherent order—though it does require more storage space.
The choice between these encoders depends on four key factors:
The nature of the data (ordinal vs. nominal)
Project memory constraints
The risk of multicollinearity
The need for model interpretability
When working with complex datasets containing multiple categorical variables, ColumnTransformer provides an elegant solution by enabling simultaneous application of different encoding types.