Tim Bray is opening my eyes to lots of the itty bitty details of i18n with Unicode. I had very vague ideas about so many things he’s writing here, so it’s an educational read, especially this:
In Java, characters are represented by the char data type, which is claimed to be a ’16-bit Unicode character’. Unfortunately, as I pointed out recently, there really is no such thing. To be precise, a Java char represents a UTF-16 code point, which may represent a character or may, via the surrogate mechanism, represent only half a character. The consequence of this is that the following methods of the String class can produce results that are incorrect: charAt, getChars, indexOf, lastIndexOf, length, and substring. Of course, if you are really sure that you will never have to deal with an ‘astral-plane’ character, to the point of being willing to accept that your software will break messily if one shows up, you can pretend that these errors can’t happen.
To me, this feels just like deciding that you’ll hever have to deal with more than 64K of memory, or a database bigger than 32 bits in size, or a date after December 31, 1999. What Hunter S. Thompson would call ‘bad craziness.’ I’ll settle for ‘shortsighted.’
Wow, and there was I thinking Java had that sorted. If you ever plan to deal with 21st-century-style i18n (ie. using Unicode), you’d better read these articles.
Spam: via BoingBoing, how to extract 500 bucks, painlessly, from telemarketers, under the TCPA. Not yet applicable to spam — but who knows, maybe in a few month’s time…
Open Source: Colm MacCarthaigh caught Dell out a few months ago; turns out they were distributing a wireless AP, the Dell Truemobile 1184, which contained a modified Linux distro — but were not distributing the source to the GPL’ed parts.
Well, all credit to Dell. They’ve admitted their slip-up, resolved the problem admirably, and openly, and have shipped Colm a CD-ROM with all the GPL’ed source on it , which Colm has made available here . Mistakes happen, but it was nicely resolved.