On trimming the [Unicode] whitespace...

by Michael S. Kaplan, published on 2007/01/07 16:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/01/07/1430714.aspx


The System.String.Trim() method is documented as:

Removes all leading and trailing white-space characters from the current String object.

Let's test theory out, shall we?

using System;
using System.Globalization;

namespace Testing {
    class TrimTest {

        [STAThread]
        static void Main() {
            for(int ich = 0x0001; ich < 0x10000; ich++) {
                bool fTrimmed = ("A" + ((char)ich).ToString()).Trim().Length == 1;
                bool fWhiteSpace = char.IsWhiteSpace((char)ich);

                if(fTrimmed ^ fWhiteSpace) {
                    Console.WriteLine("U+" + ich.ToString("x4") + "    " +
                                      fTrimmed + "    " +
                                      fWhiteSpace + "    " +
                                      CharUnicodeInfo.GetUnicodeCategory((char)ich));
                }
            }
        }
    }
}

This ought to flush out all the times that the two are different for the Basic Multilingual Plane at least. What are the results?

U+180e    False    True    SpaceSeparator
U+200b    True    False    Format       
U+202f    False    True    SpaceSeparator
U+205f    False    True    SpaceSeparator
U+feff    True    False    Format       

The five characters are:

Okay, so it is not exactly going by the formal Unicode definition of whitespace here, so if you need that then you'll need to do your own trimming if you need the formal definition (like if you are writing tools to process all things Unicode).

But luckily the characters themselves are not ones you would necessarily expect to find when using methods like string.Trim(). :-)

 

This post brought to you by U+180e (MONGOLIAN VOWEL SEPARATOR)


# MGL on 8 Jan 2007 3:35 AM:

You have displayed remarkable geekery in prefering (fTrimmed ^ fWhiteSpace) to the drearily comprehensible (fTrimmed != fWhiteSpace)!

# Michael S. Kaplan on 8 Jan 2007 3:45 AM:

:-)

I kind of figured that the only people who might actually run the code were people who knew what it meant -- everyone else would just focus on the results....

# Ben Loud on 8 Jan 2007 4:02 AM:

I spend most of my time in the Java world, so I'd thought I'd try an equivelant test and see what happens, and the result was much much worse!

codepoint trimmed whitespace category
U+0000      true    false      CONTROL
U+0001      true    false      CONTROL
U+0002      true    false      CONTROL
U+0003      true    false      CONTROL
U+0004      true    false      CONTROL
U+0005      true    false      CONTROL
U+0006      true    false      CONTROL
U+0007      true    false      CONTROL
U+0008      true    false      CONTROL
U+000e      true    false      CONTROL
U+000f       true    false      CONTROL
U+0010      true    false      CONTROL
U+0011      true    false      CONTROL
U+0012      true    false      CONTROL
U+0013      true    false      CONTROL
U+0014      true    false      CONTROL
U+0015      true    false      CONTROL
U+0016      true    false      CONTROL
U+0017      true    false      CONTROL
U+0018      true    false      CONTROL
U+0019      true    false      CONTROL
U+001a      true    false      CONTROL
U+001b      true    false      CONTROL
U+1680      false    true      SPACE_SEPARATOR
U+180e      false    true      SPACE_SEPARATOR
U+2000      false    true      SPACE_SEPARATOR
U+2001      false    true      SPACE_SEPARATOR
U+2002      false    true      SPACE_SEPARATOR
U+2003      false    true      SPACE_SEPARATOR
U+2004      false    true      SPACE_SEPARATOR
U+2005      false    true      SPACE_SEPARATOR
U+2006      false    true      SPACE_SEPARATOR
U+2008      false    true      SPACE_SEPARATOR
U+2009      false    true      SPACE_SEPARATOR
U+200a      false    true      SPACE_SEPARATOR
U+200b      false    true      SPACE_SEPARATOR
U+2028      false    true      LINE_SEPARATOR
U+2029      false    true      PARAGRAPH_SEPARATOR
U+205f       false    true      SPACE_SEPARATOR
U+3000      false    true      SPACE_SEPARATOR

The trim() algorithm just removes everything <= 0x0020,  I suppose for performance reasons. (though to be fair, its documentation does make it very clear that this is exactly how it behaves)

But it makes me wonder what on earth the .NET implementation is doing if it trims all the other higher up whitespace characters but misses those five! are they looking at the unicode data or not? odd...

# Michael S. Kaplan on 8 Jan 2007 4:17 AM:

The .NET one? It turns out is using a static list rather than the method that gives the actual Unicode data....


go to newer or older post, or back to index or month or day