by Michael S. Kaplan, published on 2007/01/07 16:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/01/07/1430714.aspx
The System.String.Trim() method is documented as:
Removes all leading and trailing white-space characters from the current String object.
Let's test theory out, shall we?
using System;
using System.Globalization;
namespace Testing {
class TrimTest {
[STAThread]
static void Main() {
for(int ich = 0x0001; ich < 0x10000; ich++) {
bool fTrimmed = ("A" + ((char)ich).ToString()).Trim().Length == 1;
bool fWhiteSpace = char.IsWhiteSpace((char)ich);
if(fTrimmed ^ fWhiteSpace) {
Console.WriteLine("U+" + ich.ToString("x4") + " " +
fTrimmed + " " +
fWhiteSpace + " " +
CharUnicodeInfo.GetUnicodeCategory((char)ich));
}
}
}
}
}
This ought to flush out all the times that the two are different for the Basic Multilingual Plane at least. What are the results?
U+180e False True SpaceSeparator
U+200b True False Format
U+202f False True SpaceSeparator
U+205f False True SpaceSeparator
U+feff True False Format
The five characters are:
Okay, so it is not exactly going by the formal Unicode definition of whitespace here, so if you need that then you'll need to do your own trimming if you need the formal definition (like if you are writing tools to process all things Unicode).
But luckily the characters themselves are not ones you would necessarily expect to find when using methods like string.Trim(). :-)
This post brought to you by U+180e (MONGOLIAN VOWEL SEPARATOR)
# MGL on 8 Jan 2007 3:35 AM:
You have displayed remarkable geekery in prefering (fTrimmed ^ fWhiteSpace) to the drearily comprehensible (fTrimmed != fWhiteSpace)!
# Michael S. Kaplan on 8 Jan 2007 3:45 AM:
:-)
I kind of figured that the only people who might actually run the code were people who knew what it meant -- everyone else would just focus on the results....
# Ben Loud on 8 Jan 2007 4:02 AM:
I spend most of my time in the Java world, so I'd thought I'd try an equivelant test and see what happens, and the result was much much worse!
codepoint trimmed whitespace category
U+0000 true false CONTROL
U+0001 true false CONTROL
U+0002 true false CONTROL
U+0003 true false CONTROL
U+0004 true false CONTROL
U+0005 true false CONTROL
U+0006 true false CONTROL
U+0007 true false CONTROL
U+0008 true false CONTROL
U+000e true false CONTROL
U+000f true false CONTROL
U+0010 true false CONTROL
U+0011 true false CONTROL
U+0012 true false CONTROL
U+0013 true false CONTROL
U+0014 true false CONTROL
U+0015 true false CONTROL
U+0016 true false CONTROL
U+0017 true false CONTROL
U+0018 true false CONTROL
U+0019 true false CONTROL
U+001a true false CONTROL
U+001b true false CONTROL
U+1680 false true SPACE_SEPARATOR
U+180e false true SPACE_SEPARATOR
U+2000 false true SPACE_SEPARATOR
U+2001 false true SPACE_SEPARATOR
U+2002 false true SPACE_SEPARATOR
U+2003 false true SPACE_SEPARATOR
U+2004 false true SPACE_SEPARATOR
U+2005 false true SPACE_SEPARATOR
U+2006 false true SPACE_SEPARATOR
U+2008 false true SPACE_SEPARATOR
U+2009 false true SPACE_SEPARATOR
U+200a false true SPACE_SEPARATOR
U+200b false true SPACE_SEPARATOR
U+2028 false true LINE_SEPARATOR
U+2029 false true PARAGRAPH_SEPARATOR
U+205f false true SPACE_SEPARATOR
U+3000 false true SPACE_SEPARATOR
The trim() algorithm just removes everything <= 0x0020, I suppose for performance reasons. (though to be fair, its documentation does make it very clear that this is exactly how it behaves)
But it makes me wonder what on earth the .NET implementation is doing if it trims all the other higher up whitespace characters but misses those five! are they looking at the unicode data or not? odd...
# Michael S. Kaplan on 8 Jan 2007 4:17 AM:
The .NET one? It turns out is using a static list rather than the method that gives the actual Unicode data....