When to make a change, when to stay the same

by Michael S. Kaplan, published on 2008/09/25 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/09/25/8964094.aspx

The roots of this blog run pretty deep.

It starts within months of the very beginning of this Blog, with FoldString.NET? No, but Whidbey has Normalization (which is kinda more cooler), where the fact that the NLS API function FoldString, which predates Unicode normalization by many years, provides very Normalization-esque functionality.

In fact, the analogues given are valid on their face since they conceptually do the same thing:

FormC      MAP_PRECOMPOSED
FormD      MAP_COMPOSITE
FormKC     MAP_PRECOMPOSED | MAP_FOLDCZONE
FormKD     MAP_COMPOSITE | MAP_FOLDCZONE

As an interesting side point,. the fact that Asmus Freytag was heavily involved with both the Microsoft function when the spec was originally being developed and the UAX when it was being written is not a coincidence -- he was instrumental in both of them.

Of course there can be (or, in this case, is) a wide chasm to deal with.

You know, the chasm between the concept and the reality.

As an example, if you run the following code on an XP machine to compare Microsoft's FoldString functionality with Unicode's normalization, looking just at the Basic Multilingual Plane (U+0001 to U+ffff):

using System;
using System.Text;
using System.Globalization;
using System.Runtime.InteropServices;

public class Test {
    [DllImport("kernel32.dll", CharSet=CharSet.Unicode, EntryPoint="FoldStringW", ExactSpelling=true, CallingConvention=CallingConvention.StdCall)]
    private static extern int FoldString(uint dwMapFlags, string lpSrcStr, int cchSrc, StringBuilder lpDestStr, int cchDest);

    private const uint MAP_PRECOMPOSED = 0x00000020; // convert to precomposed chars
    private const uint MAP_COMPOSITE = 0x00000040; // convert to composite chars

    public static void Main() {
        for(ushort uch = 0x0001; uch != 0xffff; uch++) {
            if((CharUnicodeInfo.GetUnicodeCategory((char)uch) == UnicodeCategory.Surrogate) ||
               (CharUnicodeInfo.GetUnicodeCategory((char)uch) == UnicodeCategory.OtherNotAssigned)) {
                continue;
            }
            StringBuilder sb = new StringBuilder(10);
            string st = ((char)uch).ToString();
            string stFormD = st.Normalize(NormalizationForm.FormD);
            string stComposite;
            int ret = FoldString(MAP_COMPOSITE, st, -1, sb, sb.Capacity);
            if(ret > 0) {
                stComposite = sb.ToString(0, ret - 1);
                if(stComposite != stFormD) {
                    Console.Write("USV: ");
                    Console.Write(uch.ToString("x4"));
                    Console.Write("   |||   Microsoft: ");
                    for(int ich=0; ich < ret - 1; ich++) {
                        Console.Write(((ushort)stComposite[ich]).ToString("x4"));
                        Console.Write(' ');
                    }
                    Console.Write("   |||   Unicode: ");
                    for(int ich=0; ich < stFormD.Length; ich++) {
                        Console.Write(((ushort)stFormD[ich]).ToString("x4"));
                        Console.Write(' ');
                    }
                    Console.WriteLine();
                }
            }
        }
    }
}

You will see that even just looking at the MAP_COMPOSITE vs. Normalization Form D case, there are 12,224 entries that have different results.

Now 11,172 of those are Korean so we'll throw those out for a moment.

There are still 1,052 differences between the two.

Now in Vista, the work was done to dump these older tables and instead call the normalization functionality provided by the NormalizeString function that was now also a part of the NLS API. That work involved some interesting tradeoffs that might be worthy of a blog another day, but for now I have a totally different set of things to talk about....

You see, we now have to talk about the other place that these prehistoric normalization-esque are used.

In the MultiByteToWideChar function and its MB_PRECOMPOSED and MB_COMPOSITE flags, which map as one would expect to the MAP_PRECOMPOSED asnd MAP_COMPOSITE flags from FoldString.

Now I am not going to pretend that these flags are such great things -- in fact I am on record (ref: A few of the gotchas of MultiByteToWideChar and The MB_PRECOMPOSED flag is stupid, and the MB_COMPOSITE ain't no genius either) explaining how one of them is bad and the other is not needed since t is the dfault and can cause bugs by being passed gratuitously.

It is also connected to the WC_COMPOSITECHECK flag for WideCharToMultiByte that I blogged about in A few of the gotchas of WideCharToMultiByte, though not nearly as much to worry about except occasionally. We'll ignore it for now. :-)

Now it is true that changing a mapping function whose job it is to try it's best to map as requested, and that changing to provide better data is a good thing, is for the majority of people a decision. I believe that, and it is a decision I would defend as being the correct behavior.

But changing the behavior of MultiByteToWideChar to cause it return so man potential differences, that is another kind of decision entirely. In that case I would defend the decision not to change the behavior.

Especially when we are on record as not ever wanting to add, remove, extend, or modify code pages!

Even though this extra operation is not in the code pages themselves, it is behavior that is built into the functions. And claiming on the one hand that we won't make changes ever while deciding on the other hand to make over a thousand changes? It is easy to imagine customers being unhappy with the end result.

Sponsor? We don't need no stinking sponsor! :-)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day