Who decides the category of a character and why?

by Michael S. Kaplan, published on 2006/02/16 15:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/02/16/533340.aspx

Regular reader Maurits asked in the Suggestion Box:

Eric Gunnerson recently posed a regex challenge to strip nonprintable characters from a string:


While investigating the IsC Unicode property, I found the apparent paradox that whitespace is sometimes a control character, and sometimes not...

Why is "space" U+0020 not a control character, but "horizontal tab" U+0009 is?

I can't really answer for the Unicode Consortium on such points, such as why U+0020 is of the Unicode general category Zs (Separator, space) and U+0009 is of the Unicode general category Cc (Other, control) in an authoritative way.... :-)

Though it is pretty easy to see how people generally consider the space to be different than the tab, since the latter gets into the whole realm of rich text when you consider features like tab stops and such.

The space is pretty common in most languages as a word breaker, a feature that the tab in theory could do but generally doesn't in ordinary usage. Even in plain text the tab has specific functionality in common practice like something to start a new paragraph.

So why are characters in the categories they are in? Usage usually drives the decisions, since the data that Unicode provides like the general category is expected to be picked up by implementations to guide behavior. The best answer, therefore, can be gleaned by looking at what the general category is meant to do, what implementations are supposed to do with it....


This post brought to you by "Ɣ" (U+0194, LATIN CAPITAL LETTER GAMMA)

# Stuart Ballard on 16 Feb 2006 3:39 PM:

fileformat.info isn't responding for me right now so perhaps it answers this question, but isn't LATIN CAPITAL LETTER GAMMA an oxymoron? I don't have any delusions of linguistic aptitude at all ;) but I've always believed that gamma was, pretty much by definition, a greek character...

# Michael S. Kaplan on 16 Feb 2006 3:49 PM:

I have upgraded my self-description to *notions* of linguistic attitude, Stuart. :-)

But this one is indeed a Latin character. I'll post about the history of it at some point....

# Stuart Ballard on 16 Feb 2006 4:23 PM:

I remember you upgraded yours, but in my case if I had any they would unquestionably be delusions ;)

# Michael Dunn_ on 16 Feb 2006 4:57 PM:

Perhaps the letters used in IPA are classified as Latin?  IPA uses beta, theta, phi, and two versions of gamma (and maybe others that I'm forgetting).

# Anthony Mills on 16 Feb 2006 5:35 PM:

It's a control character because in ASCII the characters 0-31 were control characters and 32-127 were printable characters.

Control characters were, of course, the ones that weren't actually printed but just jumped the terminal cursor around (13, 10, 9), beeped (7), form-fed the printer (12), etc.

So there you go. It's a control character because of legacy reasons.

# Michael S. Kaplan on 16 Feb 2006 5:54 PM:

Ah, if only it were that easy. Anthony. :-)

Look at many of the other characters in the two categories....

