by Michael S. Kaplan, published on 2006/02/16 15:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/02/16/533340.aspx
Regular reader Maurits asked in the Suggestion Box:
Eric Gunnerson recently posed a regex challenge to strip nonprintable characters from a string:
http://blogs.msdn.com/ericgu/archive/2006/01/16/513645.aspx
While investigating the IsC Unicode property, I found the apparent paradox that whitespace is sometimes a control character, and sometimes not...
Why is "space" U+0020 not a control character, but "horizontal tab" U+0009 is?
I can't really answer for the Unicode Consortium on such points, such as why U+0020 is of the Unicode general category Zs (Separator, space) and U+0009 is of the Unicode general category Cc (Other, control) in an authoritative way.... :-)
Though it is pretty easy to see how people generally consider the space to be different than the tab, since the latter gets into the whole realm of rich text when you consider features like tab stops and such.
The space is pretty common in most languages as a word breaker, a feature that the tab in theory could do but generally doesn't in ordinary usage. Even in plain text the tab has specific functionality in common practice like something to start a new paragraph.
So why are characters in the categories they are in? Usage usually drives the decisions, since the data that Unicode provides like the general category is expected to be picked up by implementations to guide behavior. The best answer, therefore, can be gleaned by looking at what the general category is meant to do, what implementations are supposed to do with it....
This post brought to you by "Ɣ" (U+0194, LATIN CAPITAL LETTER GAMMA)
# Stuart Ballard on 16 Feb 2006 3:39 PM:
# Michael S. Kaplan on 16 Feb 2006 3:49 PM:
# Stuart Ballard on 16 Feb 2006 4:23 PM:
# Michael Dunn_ on 16 Feb 2006 4:57 PM:
# Anthony Mills on 16 Feb 2006 5:35 PM:
# Michael S. Kaplan on 16 Feb 2006 5:54 PM: