On trimming the [Unicode] whitespace...

by Michael S. Kaplan, published on 2007/01/07 16:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/01/07/1430714.aspx

This ought to flush out all the times that the two are different for the Basic Multilingual Plane at least. What are the results?

Okay, so it is not exactly going by the formal Unicode definition of whitespace here, so if you need that then you'll need to do your own trimming if you need the formal definition (like if you are writing tools to process all things Unicode).

But luckily the characters themselves are not ones you would necessarily expect to find when using methods like string.Trim(). :-)

I spend most of my time in the Java world, so I'd thought I'd try an equivelant test and see what happens, and the result was much much worse!

codepoint trimmed whitespace category
U+0000 true false CONTROL
U+0001 true false CONTROL
U+0002 true false CONTROL
U+0003 true false CONTROL
U+0004 true false CONTROL
U+0005 true false CONTROL
U+0006 true false CONTROL
U+0007 true false CONTROL
U+0008 true false CONTROL
U+000e true false CONTROL
U+000f true false CONTROL
U+0010 true false CONTROL
U+0011 true false CONTROL
U+0012 true false CONTROL
U+0013 true false CONTROL
U+0014 true false CONTROL
U+0015 true false CONTROL
U+0016 true false CONTROL
U+0017 true false CONTROL
U+0018 true false CONTROL
U+0019 true false CONTROL
U+001a true false CONTROL
U+001b true false CONTROL
U+1680 false true SPACE_SEPARATOR
U+180e false true SPACE_SEPARATOR
U+2000 false true SPACE_SEPARATOR
U+2001 false true SPACE_SEPARATOR
U+2002 false true SPACE_SEPARATOR
U+2003 false true SPACE_SEPARATOR
U+2004 false true SPACE_SEPARATOR
U+2005 false true SPACE_SEPARATOR
U+2006 false true SPACE_SEPARATOR
U+2008 false true SPACE_SEPARATOR
U+2009 false true SPACE_SEPARATOR
U+200a false true SPACE_SEPARATOR
U+200b false true SPACE_SEPARATOR
U+2028 false true LINE_SEPARATOR
U+2029 false true PARAGRAPH_SEPARATOR
U+205f false true SPACE_SEPARATOR
U+3000 false true SPACE_SEPARATOR

The trim() algorithm just removes everything <= 0x0020, I suppose for performance reasons. (though to be fair, its documentation does make it very clear that this is exactly how it behaves)

But it makes me wonder what on earth the .NET implementation is doing if it trims all the other higher up whitespace characters but misses those five! are they looking at the unicode data or not? odd...