Unicode Details
The Unicode Character Set, Bits, Bytes, and UTF's
A Character Set is a numbered collection of characters used to represent text. Many of the non-Unicode character sets (e.g. Latin-1) used for individual scripts have 256 elements or codepoints.
Unicode was designed to include the characters needed to represent all scripts and languages in a single character set, and has over a 1.1 million codepoints.
When actually used in a computer, each element of a character set must be given a value in terms of 8-bit bytes, and this is called the character encoding (or in email and web page headers, the "charset").
While the older 256-element character sets have a single way (or encoding/charset) for assigning byte values to characters, Unicode has 3 basic options, called Unicode Transformation Formats, or UTF's. UTF-32 represents each character as 4 bytes. UTF-16 represents each one as 2 bytes or 4 bytes depending on its place in the code space. And UTF-8 represents each character as one, two, three, or four bytes.
UTF-16 and 32 are not normally used to represent Unicode over the internet. This is because various internet processes depend on the recognition of 7-bit ASCII strings to function properly. When such ASCII strings are encoded in the UTF -32 and 16 formats, they become interspersed with bytes of the form 00, which represent the NULL control character. This can make correct recognition difficult. Also there is the general possiblity of UTF-16/32 bytes being interpreted as 7-bit ASCII when this was not the intention, which could cause major problems. In HTML documents, for example, almost all of the 33 control characters present in 7-bit ASCII are not allowed.
UTF-8 is a somewhat complex format designed to avoid these problems and make the use of Unicode possible over the internet. 7-bit ASCII is represented by one byte only, and is never present in the multibyte sequences used to represent all other characters. Thus 7-bit ASCII strings have no 00 bytes in between and are interpreted correctly, and parts of multibyte characters can never be misinterpreted as 7-bit ASCII. The first byte of a multibyte sequence is encoded in a way to indicate how many bytes are being used for the character.
Early versions of Unicode had a smaller codespace, which required 16 bits or 2 bytes for every character. Since UTF-16 hex was identical to the hex codepoint value, a habit arose of using the terms Unicode, two-byte, and UTF-16 as synonyms, or calling UTF-16 "raw Unicode." This is no longer a correct way to employ these words, since Unicode is a character set and not an encoding, requires more than 2 bytes, and the three UTF's are all represent Unicode equally. But remnants of the old usage are still found in some OS X encoding menus, which list Unicode (meaning UTF-16) and UTF-8 as separate items. These should properly be stated as Unicode (UTF-16) and Unicode (UTF-8).
Some more info can be found at:
http://czyborra.com/utf/
http://www.unicode.org/unicode/faq/utf_bom.html#1
http://www.cl.cam.ac.uk/~mgk25/unicode.html