by Michael S. Kaplan, published on 2007/08/04 01:59 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/08/03/4217350.aspx

The question I was asked the other day is the one in the title of this post. I thought I'd give some of the backstory.... 

I remember when I was explaining to Murray Sargent a few years back about how Windows XP and Server 2003 both sorted the Math Alphanumerics range -- a set of over 1000 Unicode characters in the Supplementary Multilingual Plane (more on the allocation stuff here and more on the math range itself in a post from Murray here).

(Conversation reconstructed from memory, Murray can hopefully make any corrections if he feels I misrepresented anything, as I certainly am not trying to!)

"So we do nothing special with them -- we sort them in binary order with the rest of the SMP. Is that okay?"

"I think so, Michael. What else would you do with them?"

"Well, they could be sorted after the letters that they are similar to, maybe."

"Why would you want to do that? These are symbols, used in mathematical expressions."

"True. But what if you were searching for one of them -- would you really want to force people to type the exact code point or would you want something fuzzier?"

"Good point. But that could be done outside of Windows -- just like the input is, in Word [2007 - michkap]."

"Okay, how about if you are building an index and want one or more of them listed. Would you want them to be separate from the letters they look like?"

"Oh, wait. You're right. They should be sorted near the letters they look like."

"That's what I thought you'd say. Too bad we can't do it."

"Why not?"

"Well, our default sorting table is UTF-16 based, and the math alphanumerics would require a compression table -- which requires a locale. And while assigning a new LCID to "Math - Flatland" would get many cool points with math geeks, it is not so sustainable from a business sense. And math is a bit too universal to jump put in one locale."

"So what can be done?"

After talking to some people over in NLS, and many ideas were considered and discarded for reasons like performance concerns, complexity, and so on, such as:

In the end, the decision was made -- a new LCID was added (an alternate sort on the invariant locale). The SORTID is defined in winnt.h:

    #define SORT_INVARIANT_MATH              0x1     // Invariant (Mathematical Symbols)

So, combined with the invariant locale, you get MAKELCID(MAKELANGID(LANG_INVARIANT, SUBLANG_NEUTRAL), SORT_INVARIANT_MATH) or 0x1007f....

Not quite the Math - Flatland locale, but thats what custom locales are for, right?

No note this is not perfect -- for example any locale that sorts any of those letters differently (like the way Y comes just after I in Lithuania) would not see the various associated math symbols moved, But to fix that would mean adding additional alternate sorts for such locales, which could also be less than ideal (it might be the same plumbing that would be underneath a special flag, but is a bigger deal to do the actual work for, in any case).

But for now, at least the problem is solved in the default table - which means English and ~79 other locales. And then when you consider that so many of the 128 locales that have various exceptions don't use either Latin or Greek script and thus wouldn't sort any of these letters differently, it means there are workarounds to get those indexes sorted in many of those other locales, too....


This post brought to you by 𝑪 (U+1d46a, a.k.a. MATHEMATICAL BOLD ITALIC CAPITAL C)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/02/10 Microsoft still does not use the UCA; the converse is also true

go to newer or older post, or back to index or month or day