Character Encoding - micheledpierri.com: statistics, data analysis and coding

Character encoding is the process of assigning a unique number to each character, enabling computers to exchange data in a standardized and unambiguous manner.

Various encoding systems have developed over time and across different regions. These systems often lack compatibility, have space limitations (and consequently character limitations), and may use the same encoding for different characters.

To address these issues, the Unicode system was developed in the 1980s. It aimed to create a universal encoding that includes characters used in all languages worldwide, as well as symbols and emojis.

The first Unicode standard, published in 1991, included 7,000 characters. Currently, it encompasses 143,000 characters and is adopted by all operating systems, programming languages, and communication systems.

Despite Unicode’s introduction, many pre-existing and subsequent encodings continue to exist. This persistence is due to legacy systems using pre-Unicode encodings, the higher memory requirements of Unicode, and the preference for optimized encodings in specific contexts (such as for Chinese or Japanese languages).

As a result, pre-Unicode encodings and Unicode-implementing systems currently coexist. The most widely used ones are listed below.

ASCII (American Standard Code for Information Interchange)

ASCII, one of the earliest encodings, comprises 128 characters. It’s primarily suited for the English language, as it doesn’t include accented or special characters.

ISO-8859-1 (Latin-1)

ISO-8859-1 can be considered an extension of ASCII encoding that includes accented letters. It’s suitable for many European languages and encompasses 256 characters. However, it still lacks some accented characters used in certain languages.

UTF-8 (Unicode Transformation Format 8-bit)

UTF-8 is the most widespread encoding. It uses 1 to 4 bytes to encode not only ASCII characters but also characters from other languages such as Chinese and Arabic, as well as symbols. It complies with Unicode standards.

UTF-16

UTF-16 uses 2 to 4 bytes to represent characters. It’s less efficient than UTF-8 for English and some Latin texts and is used in some Windows systems.

UTF-32

UTF-32 has a fixed length of 4 bytes per character. It’s simpler than the previous encodings but less efficient.

The existence of multiple encoding systems can cause significant problems. If you try to open a text file using a different encoding from the original, you may see strange or illegible characters.

To detect the encoding of a text using Python, you can use two libraries: chardet and charset-normalizer. Both libraries need to be installed in your system using pip.

Opening a file in binary mode allows chardet to analyze its content and determine the encoding:

import chardet

# Open the file in binary mode to read its content
with open('example_text.txt', 'rb') as f:
    content = f.read()

# Use chardet to detect the encoding
result = chardet.detect(content)

# Print the detected encoding
print(f"Detected encoding: {result['encoding']}")

When using charset-normalizer, the process differs slightly, as opening the file isn’t necessary:

from charset_normalizer import from_path

# Detect the encoding of the file
result = from_path('example_text.txt').best()

# Show the detected encoding
print(f"Detected encoding: {result.encoding}")

Using chardet, we can also convert text from one encoding to another, such as UTF-8:

import chardet

# Step 1: Read the file content in binary mode
with open('example_text.txt', 'rb') as f:
    content = f.read()

# Step 2: Detect the original encoding using chardet
result = chardet.detect(content)
original_encoding = result['encoding']

# Step 3: Decode the content using the detected encoding
decoded_text = content.decode(original_encoding)

# Step 4: Re-encode the text as UTF-8 and save it to a new file
with open('example_text_utf8.txt', 'w', encoding='utf-8') as f:
    f.write(decoded_text)

print("File has been successfully converted to UTF-8.")

The chardet library can decode character formats in text files (txt), HTML, XML, and CSV files. However, its detection capability may be less effective for text embedded in binary files. For databases where encoding information is needed, it might be necessary to extract the textual data first (for instance, by using SQL queries to dump text files) before applying chardet to the extracted text.