What are code points, code units, supplementary characters, and all this other stuff?
A coded character set is a character set (a collection of characters) where each character has been assigned a unique number. At the core of the Unicode standard is a coded character set that assigns the letter “A” the number 004116 and the letter “€” (the symbol for the euro currency) the number 20AC16. The Unicode standard always uses hexadecimal numbers, and writes them with the prefix “U+”, so the number for “A” is written as “U+0041”. Code points are the numbers that can be used in a coded character set. A coded character set defines a range of valid code points, but doesn’t necessarily assign characters to all those code points. The valid code points for Unicode are U+0000 to U+10FFFF. Unicode 4.0 assigns characters to 96,382 of these more than a million code points. Supplementary characters are characters with code points in the range U+10000 to U+10FFFF, that is, those characters that could not be represented in the original 16-bit design of Unicode. The set of characters from