April 27, 2003
Unicode
Tim Bray has a rather wonderful exposition of Unicode and its UTF encodings: Characters vs. Bytes: Processing UTF-8 characters sequentially is about as efficient, for practical purposes, as any other encoding.Looking forward to the next instalment, wherein it sounds like he'll have some harsh words for the Unicode-is-16-bit-chars brigade. In practice, character encoding will only give you problems if you ignore it. Don't ignore it: at each place in your application, just define which encoding you're working with (especially if you're looking at data 'on the wire' or on disk). Is this a UTF8 byte array? An ISO-8859-1 byte array? Is it a string of BMP Unicode? Or something more exotic? It doesn't matter what, as long as you know what. If there's ambiguity (about the codepoint encoding, or higher-level encodings: eg. if you don't know whether a string is XML entity-encoded or not, as I saw recently), your application is broken. Now, someone please, give fonts the same treatment. I had such problems with Windows fonts last time I needed to deal with Asian characters. Has it got better? |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
vcard
archives: January 2005 December 2004 November 2004 October 2004 September 2004 August 2004 July 2004 June 2004 May 2004 April 2004 March 2004 February 2004 January 2004 December 2003 November 2003 October 2003 September 2003 August 2003 July 2003 June 2003 May 2003 April 2003 March 2003 February 2003 January 2003 December 2002 November 2002 October 2002 September 2002 August 2002 July 2002 June 2002 May 2002 April 2002 March 2002 February 2002 January 2002 December 2001 November 2001 October 2001 September 2001 August 2001 July 2001 June 2001 see also: {groove: [ ray, matt, paresh, mike, jeff, john ], other: [ /* more blogroll to follow */ ] } The views expressed on this weblog are mine alone and do not necessarily reflect the views of my employer. RSS 2.0 RSS 1.0 |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||