Microsoft is a Form 'C' shop, Part 1

by Michael S. Kaplan, published on 2007/10/29 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/10/29/5756924.aspx

Microsoft has had Unicode as a part of its operating system offerings since the easrliest days of its 32-bit platforms.

And a lot that support predates asnything that Unicode later chose to provide, thus we don't use the Unicode Collation Algorithm for our sorting, for years we did not use Unicode normalization for our equivalences, and all kinds of random snafus like that somewhat random Tibetan/Myanmar thing with us not picking up Unicode changes when they happened still manage to pop up after all of these years.

Now for the most part, data coming out of Microsoft's keyboards, data entry methods, functions, methods, and algorithms has always been in what we for years called the precomposed form, which Unicode calls Unicode Normalization Form "C" in their UAX #15. Other than hiccups like code page 1258 (discussed here and here), data always tended to be in Form "C".

In fact, if you convert data to Form "D" then there are a bunch of places like in collation that you won't get the most accurate results, even in Vista where most of the equivalent forms were added to the tables to try to make the impact of using Form "D" text less noticeable....

Yet even today if you convert to Form "D" then all kinds of languages from Korean to Tibetan won't always sort as expected or as deisgned. And Vista features like LINGUISTIC_IGNORE* flags won't always return exactly equivalent results if you compare Form "C" text to Form "D" text. You are always better off converting text if you are getting it from other sources before using the NLS API for the text....

Chalk it up to gremlins in the computers and such.... not converting what they do not seem to handle on their own....

Now note that products like Access and SQL Server, being based on similar technologies only up[dated less often, still had problems even doday..

Anyway, future posts in this series will be explaining other uences our "Form 'C'- ness". This is just the intro.

This post brought to you by ೀ (U+0cc0, a.k.a. KANNADA VOWEL SIGN II)

# Andrew on 29 Oct 2007 7:24 PM:

I'd assume that Microsoft's "Form 'C'- ness" doesn't apply to its implementation of Vietnamese?

# Michael S. Kaplan on 29 Oct 2007 8:58 PM:

Yes, that's the cp1258 stuff I link to in the post. :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day