The difference between C1_SPACE-ing out and drawing a C1_BLANK
by Michael S. Kaplan, published on 2007/06/11 11:39 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/06/11/3230072.aspx
Over in the microsoft.public.win32.programmer.international newsgroup, PLS asked the following question:
Can someone please explain the difference between C1_SPACE and
C1_BLANK in the character types returned from GetStringTypeEx?
What characters fall in either catagory?
Thanks,
++PLS
Microsoft can't really take credit for the meaning of the C1_CTYPE flags that GetStringTypeW returns (remember not to call GetStringTypeEx, as I pointed out in To Ex or not to Ex? THAT is the question!).
The original meaning comes from that whole POSIX internationalization world, where the actual definition you can find if you really look for it is:
-
C1_SPACE -- White-space characters. Characters also specified as C1_UPPER, C1_LOWER, C1_ALPHA, C1_DIGIT, or C1_XDIGIT are not allowed. The characters <space>, <form-feed>, <newline>, <carriage-return>, <tab>, and <vertical-tab> are automatically included.
-
C1_BLANK -- Characters classified as blank. The characters <space> and <tab> are automatically included.
Now given these slightly odd sorts of definitions, the script that Microsoft uses to figure out what to do with it's implementation is:
Here is where most of the relevant characters fall (note the C1_DEFINED, which was added in XP and convinced us as a team to be much more cautious about adding ctype values!):
-
U+0009 (CHARACTER TABULATION) -- C1_SPACE | C1_BLANK | C1_CTRL | C1_DEFINED
-
U+000a (LINE FEED) -- C1_SPACE | | C1_CTRL | C1_DEFINED
-
U+000b (LINE TABULATION) -- C1_SPACE | | C1_CTRL | C1_DEFINED
-
U+000c (FORM FEED) -- C1_SPACE | | C1_CTRL | C1_DEFINED
-
U+000d (CARRIAGE RETURN) -- C1_SPACE | | C1_CTRL | C1_DEFINED
-
U+0020 (SPACE) -- C1_SPACE | C1_BLANK | C1_DEFINED
-
U+00a0 (NO-BREAK SPACE) -- C1_SPACE | C1_BLANK | C1_DEFINED
-
U+1680 (OGHAM SPACE MARK) -- C1_SPACE | C1_DEFINED
-
U+180e (MONGOLIAN VOWEL SEPARATOR) -- C1_SPACE | C1_DEFINED
-
U+2000 (EN QUAD) -- C1_SPACE | C1_DEFINED
-
U+2001 (EM QUAD) -- C1_SPACE | C1_DEFINED
-
U+2002 (EN SPACE) -- C1_SPACE | C1_DEFINED
-
U+2003 (EM SPACE) -- C1_SPACE | C1_DEFINED
-
U+2004 (THREE-PER-EM SPACE) -- C1_SPACE | C1_DEFINED
-
U+2005 (FOUR-PER-EM SPACE) -- C1_SPACE | C1_DEFINED
-
U+2006 (SIX-PER-EM SPACE) -- C1_SPACE | C1_DEFINED
-
U+2007 (FIGURE SPACE) -- C1_SPACE | C1_DEFINED
-
U+2008 (PUNCTUATION SPACE) -- C1_SPACE | C1_DEFINED
-
U+2009 (THIN SPACE) -- C1_SPACE | C1_DEFINED
-
U+200A (HAIR SPACE) -- C1_SPACE | C1_DEFINED
-
U+202F (NARROW NO-BREAK SPACE) -- C1_SPACE | C1_DEFINED
-
U+205F (MEDIUM MATHEMATICAL SPACE) -- C1_SPACE | C1_DEFINED
-
U+3000 (IDEOGRAPHIC SPACE) -- C1_SPACE | C1_BLANK | C1_DEFINED
I even got to learn something when I built this table -- I always assumed that the implementation of char.IsWhiteSpace that added some other random characters (ref: here) was due to to silly VB backward compatibility issues.
Which it is.
However, it is clear to me now that the original VB silliness was due to an attempt to support POSIX (probably because internally it used the CRT isspace function, which in turn is dependent on the NLS data returned by GetStringTypeW, above.
It is the down side of assuming anything is silly -- it usually turns out to be the fault of code you used to own at one point!
This post brought to you by (U+1680, a.k.a. OGHAM SPACE MARK)
no comments
Please consider a
donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.
referenced by
go to newer or older post, or back to index or month or day