Skip to content

Archives

Unicode

Oops! Looks like 2-byte Unicode — UCS-2, aka Unicode 1.0 — will be running into trouble shortly; according to this and this on debian-i18n from back in 2000, several Asian charsets will shortly require 4-byte Unicode characters, which means using either UTF-8 or UCS-4. In particular, correct display of proper nouns in Japanese apparently requires use of the 4-byte planes.

Unicode 1.0 is used widely, in MS products and Java. Expect ‘flag days’ galore when this has to change.

Unicode 2.0 introduced a concept called a ‘surrogate pair’ to fix this; it’s basically introducing multibyte characters into the supposedly fixed-width character-based UCS-2; so all those ‘length == nchars’ assumptions will break — again. Argh.

Now I know why the Linux vendors are going for UTF-8 instead…