September 29, 2002
I missed an encoding.
Weblogging truns out to be a good way to gnaw at unresolved issues. I had an open problem -- a weirdly subtle one -- with characters moving through the pipe, and although just writing about data pipes wouldn't fix the problem, it did help me get to the "aha!" point of seeing what I'd missed. The missing piece: UTF-8 encoding. The code I'm writing here is JavaScript, and by default all strings are considered Unicode. So, to get (say) a "trademark" symbol, I can write String.fromCharCode(0x2122), or write XML markup saying "™" and get a character with numeric value way up in the thousands. Weirdly, these very-high characters passed right through the pipe - in one end, out the other - in both directions, without breaking. Yet more mundane symbols such as the pound sign ("£") and Latin-1 accented characters would detonate along the way. (The actual symptom was reporting that my SQL Server was stopped, but that was a red herring). So, that pound sign was character 163 ("%A3" after URL-escaping), but the URL wanted UTF-8, where the pound becomes two bytes: 194 163, or %C2%A3. UTF8 is wonderful and slightly strange - don't ever try working backwards through a UTF8 string! September 27, 2002
Data path
Right now my work includes some hairy string handling. We're pulling text and XML from a Web service, parsing it, serializing (and chopping up the resulting strings) and deserializing, storing on disk, pushing back out to web services... That includes data conversions (dates as milliseconds-of-Unix-epoch, ISO8601 text, locale text, and a few others; numbers, currencies, single- and multi-line text), XML entity encoding and decoding, URL encoding... you get the idea. I'm obsessive about doing this right, because if it's wrong it will break unpredictably. Not just with weird international stuff, but with regular users typing regular text. I'm a strong believer in the notion of a "clean data path". This may be influenced by having spent a lot of time in localization work, where it's really important to (a) know what character set you're dealing with, and (b) avoid unnecessary string operations: text strings are way too subtle and complex, and the cause of much pain. Any time you take apart a string, rearrange its contents, or reassemble a string, there's plenty of room for errors. The other influence here is electronic engineering, and the "signal path". What we're building, if it's right, will be a very close analog of another obsession of mine: the Quad 405. It rocks. September 25, 2002 |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||
vcard
archives: January 2005 December 2004 November 2004 October 2004 September 2004 August 2004 July 2004 June 2004 May 2004 April 2004 March 2004 February 2004 January 2004 December 2003 November 2003 October 2003 September 2003 August 2003 July 2003 June 2003 May 2003 April 2003 March 2003 February 2003 January 2003 December 2002 November 2002 October 2002 September 2002 August 2002 July 2002 June 2002 May 2002 April 2002 March 2002 February 2002 January 2002 December 2001 November 2001 October 2001 September 2001 August 2001 July 2001 June 2001 see also: {groove: [ ray, matt, paresh, mike, jeff, john ], other: [ /* more blogroll to follow */ ] } The views expressed on this weblog are mine alone and do not necessarily reflect the views of my employer. RSS 2.0 RSS 1.0 |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||