Next Previous Contents

4. What is Unicode

Traditionnaly, character encodings use 8 bits, and thus are limited to 256 characters. This causes problems because:

  1. it's not enough for some languages;
  2. people speaking languages using different encodings have to choose which one they use, and have to switch the system's state when changing the language, which makes it difficult to mix several languages in the same file;
  3. etc...

META: The fillowing stuff about what is done by whom is a little fuzzy; I have to investigate that further.

Thus the 16-bit UCS2 (Universal Character Set on 2 bytes), and the 32-bit UCS4 (yes, 4 bytes) were created to handle and mix all of our world's scripts. For convenience, the UTF8 encoding was designed as a variable-length encoding (with 8 bytes of maximum length) with ASCII compatibility; all chars that have a UCS4 encoding can be expressed as a UTF8 sesquence, and vice-versa.

Note that there is also a normalization effort at ISO (10646), about which the unicode(7) manpage tells it produces the UCS charsets.

The Unicode consortium defines its own standard name Unicode, which is I believe compatible with ISO 10646 charsets.

See: unicode(7), utf-8(7).


Next Previous Contents