What is equal to some may not be equal to others

by Michael S. Kaplan, published on 2006/02/02 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/02/02/522919.aspx

Someone going by the handle AC asked me via email:

You have mentioned that Google has trouble with equal strings not being treated equally. But you have never talked about similar problems with Office or other Microsoft applications. Do other problems not exist or are you not allowed to talk about that?

Hmmm. Well, it is true, I have talked about Google's problems with Unicode canonical equivalence in the past, but mainly in this post, which also talks about similar problems in Microsoft's various web search technologies.

I am really not in the "Google bashing" business, but I am also not a paid spokesperson for them, either. I am paid by Microsoft but not as an apologist so when something is wrong in an MS product, I have not been shy about talking about it....

Now I have not covered as much in the Office world as in Windows mainly because I know more about the latter than the former (except for stuff in Microsoft Access, which I have posted about before).

But (to throw a random example into the mix) I can mention that the Find/Replace functionality in Word vs. Notepad could use a little help. :-)

We will test with the following strings:

aaeå   U+0061 U+0061 U+0065 U+00e5

aæå    U+0061 U+00e6 U+0061 U+030a

Not very fair, but then I never recall claiming fairness. :-)

Now if you test searching for one of these strings within Notepad by using the other as the string you are looking, they will find one another!

But if you try to do the same in Word, you will not....

And here are two more strings:

aåå   U+0061 U+00e5 U+0061 U+030a

aåå   U+0061 U+0061 U+030a U+00e5

Once again, with Notepad one string will find the other, and with Word no match will be found!

(For those keeping score, Visual Studio will not find the strings either. Excel will find them, while Access will not).

Now I have stacked the deck a little bit here, since as I have pointed out before, CompareString will treat the ae ligature (æ) as being equal to the letters "ae". But by using the precomposed and composite forms of a ring in each, I have created two strings that have the same length and which CompareString will treat as equal (on most user locales, at least).

I also was armed with the knowledge of how Notepad is doing its searching -- it looks for the first character in the string, and on a match it compares the full string with CompareString. Since it assumes the lengths will be the same, everything works!

(I could have easily made Notepad fail by changing the length or doing a case sensitive search -- which Notepad treats as a binary comparison. But like I said, I was not trying to play fair!)

The important point here is that by using CompareString, Notepad makes it part way to canonical equivalence since, as I pointed out back in 2004 in the post Normalization and Microsoft -- what's the story?, we do cover a lot of this in the NLS API and we were doing so since the time that the formal definition for canonical equivalence was being formed in Unicode.

If it were not for that string length issue, then everything would be fine there!

Now what Word is doing here is anyone's guess -- it has no problems with case insensitivity but it is not treating as equal any of what CompareString treats as equal. It is even ignoring diacritics and case and it still is not treating the strings as being equal.

Whatever Word, Access, and Visual Studio are doing here, they are doing it wrong. and Notepad is getting it half right.

It may be time to start getting more of these applications to do things correctly, huh?

(Notepad actually does get better in Vista -- check out the February CTP; I hope that the others will get better in future versions of Office and VS! And I especially hope that both MS and Google can get better at this in their respective search technologies, since some languages are really suffering in the meantime!)


This post brought to you by "å" (U+00e5, a.k.a. LATIN SMALL LETTER A WITH RING ABOVE)

# Christian Kaiser on 2 Feb 2006 4:44 AM:

Well Notepad will also fail if the 'ae' is the first character (as comparing the first character will then fail), so it's also only correct in SOME cases.

I'm not saying anything in favour of Office here, just that any of these comparisons are flawed.

Do you see a faster way to do a string comparison than to use CompareString(), just like NOTEPAD's programmer's idea, but correct?


# Michael S. Kaplan on 2 Feb 2006 10:01 AM:

Hi Christian!

The strings are not flawed, they were specifically constructed to make it superficially appear that only Notepad and Excel know how to do stuff. :-)

It is an actually interesting "interview question", so I was going to ask folks what *they* thought might be faster....

# Michael S. Kaplan on 2 Feb 2006 10:37 AM:

See http://blogs.msdn.com/michkap/archive/2006/02/02/523189.aspx for the question. :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2006/02/02 Another interview question

go to newer or older post, or back to index or month or day