ikerhurtado.com
You're in
Iker Hurtado's pro blog
Developer | Entrepreneur | Investor
Software engineer (entrepreneur and investor at times). These days doing performant frontend and graphics on the web platform at Barcelona Supercomputing Center

Data representation (II): Character encoding

16 Dec 2014   |   iker hurtado  
Share on Twitter Share on Google+ Share on Facebook
By making this entry I intend once and for all to understand the main character sets (especially Unicode) in detail and write it down in order to come back later. For a very detailed and complete info see the links provided below.

In computer memory, character are "encoded" or "represented" using a character encoding scheme (also called character set, charset or character map) that must be known before a binary pattern can be interpreted.

I'm interested in most commonly-used character encoding schemes in western cultures, these are: 7-bit ASCII, 8-bit Latin-x (ISO/IEC 8859-x) series for western european characters and Unicode for internationalization (i18n).

7-bit/US ASCII Code

The ASCII code is one of the earlier character coding schemes. I was originally a 7-bit code but more recently it has been extended to 8-bit to better fit the computer memory organization. This scheme represents the basic English characters and some more symbols.

ISO/IEC 8859 series and Latin-1 (ISO-8859-1)

ISO/IEC-8859 is a collection of 8-bit character encoding standards for the western languages.

The ISO/IEC 8859-1, or Latin-1 in short, is one of them. It's the most commonly-used encoding scheme for western european languages and it supports English, German, Italian, Portuguese and Spanish.

Latin-1 is backward compatible with the 7-bit US-ASCII code. That is, the first 128 characters in Latin-1 is the same as US-ASCII. The upper code numbers are allowed to less common symbols like £, €, ó, etc.

Windows-1252

Windows-1252 or CP-1252 is a character encoding of the Latin alphabet used by default in the legacy components of Microsoft Windows in English and some other Western languages.

This character encoding is a superset of ISO 8859-1, but differs from the ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range.

The Unicode standard

Unicode is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. Unicode aims to provide a standard character encoding scheme, which is universal, efficient, uniform and unambiguous.

Unicode is backward compatible with the 7-bit US-ASCII and 8-bit Latin-1 (ISO-8859-1). So, the first 128 characters are the same as US-ASCII; and the first 256 characters are the same as Latin-1.

Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8 and UTF-16.

The original Unicode character encodings (called UCS-2/UCS-4 or Unicode Character Set-2 byte/4 byte) are grossly inefficient if the document contains mainly ASCII characters. So, variable-length encoding schemes, such as UTF-8, were created.

UTF-8

UTF-8 uses one byte for any ASCII character and up to four bytes for other characters.

UTF-8 codes are obtained from Unicode code points by a simple transformation:

Bits Unicode UTF-8 Code Bytes
7 00000000 0xxxxxxx 0xxxxxxx 1
11 00000yyy yyxxxxxx 110yyyyy 10xxxxxx 2
16 zzzzyyyy yyxxxxxx 1110zzzz 10yyyyyy 10xxxxxx 3
21 000uuuuu zzzzyyyy yyxxxxxx 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx 4

UTF-16

UTF-16 is, like UTF-8, a variable-length Unicode character encoding scheme, which uses 2 to 4 bytes. UTF-16 extends UCS-2, using one 16-bit unit for the characters that were representable in UCS-2 and two 16-bit units (4 × 8 bit) to handle each of the additional characters. UTF-16 is not commonly used.

Character organization in text files

For multi-byte character schemes, it's necessary to take care of the order of the bytes in storage. In big endian, the most significant byte is stored at the memory location with the lowest address (big byte first). In little endian, the opposite (little byte first). Big endian, which produces a more readable hex dump, is more commonly-used. UTF-8 file is always stored as big endian.

One more typical issue in file storing is that different operating platforms use different character as the so-called line delimiter (EOL). Two non-printable control characters are involved: 0AH (Line feed or LF) and 0DH (Carriage return or CR). By platforms:

Unix/Linux family      0AH    "\n"
Windows/DOS            OD0AH  "\r\n"
Mac                    0DH    "\r"

In programming languages such as C/C++/Java, line-feed (0AH) is denoted as '\n', carriage-return (0DH) as '\r' and tab (09H) as '\t'.


In addition to every charset link to wikipedia posted previously I recommend this resource: A Tutorial on Data Representation - Integers, Floating-point numbers, and characters

POST A COMMENT: