by Michael S. Kaplan, published on 2006/02/02 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/02/02/522919.aspx
Someone going by the handle AC asked me via email:
You have mentioned that Google has trouble with equal strings not being treated equally. But you have never talked about similar problems with Office or other Microsoft applications. Do other problems not exist or are you not allowed to talk about that?
Hmmm. Well, it is true, I have talked about Google's problems with Unicode canonical equivalence in the past, but mainly in this post, which also talks about similar problems in Microsoft's various web search technologies.
I am really not in the "Google bashing" business, but I am also not a paid spokesperson for them, either. I am paid by Microsoft but not as an apologist so when something is wrong in an MS product, I have not been shy about talking about it....
Now I have not covered as much in the Office world as in Windows mainly because I know more about the latter than the former (except for stuff in Microsoft Access, which I have posted about before).
But (to throw a random example into the mix) I can mention that the Find/Replace functionality in Word vs. Notepad could use a little help. :-)
We will test with the following strings:
aaeå U+0061 U+0061 U+0065 U+00e5
aæå U+0061 U+00e6 U+0061 U+030a
Not very fair, but then I never recall claiming fairness. :-)
Now if you test searching for one of these strings within Notepad by using the other as the string you are looking, they will find one another!
But if you try to do the same in Word, you will not....
And here are two more strings:
aåå U+0061 U+00e5 U+0061 U+030a
aåå U+0061 U+0061 U+030a U+00e5
Once again, with Notepad one string will find the other, and with Word no match will be found!
(For those keeping score, Visual Studio will not find the strings either. Excel will find them, while Access will not).
Now I have stacked the deck a little bit here, since as I have pointed out before, CompareString will treat the ae ligature (æ) as being equal to the letters "ae". But by using the precomposed and composite forms of a ring in each, I have created two strings that have the same length and which CompareString will treat as equal (on most user locales, at least).
I also was armed with the knowledge of how Notepad is doing its searching -- it looks for the first character in the string, and on a match it compares the full string with CompareString. Since it assumes the lengths will be the same, everything works!
(I could have easily made Notepad fail by changing the length or doing a case sensitive search -- which Notepad treats as a binary comparison. But like I said, I was not trying to play fair!)
The important point here is that by using CompareString, Notepad makes it part way to canonical equivalence since, as I pointed out back in 2004 in the post Normalization and Microsoft -- what's the story?, we do cover a lot of this in the NLS API and we were doing so since the time that the formal definition for canonical equivalence was being formed in Unicode.
If it were not for that string length issue, then everything would be fine there!
Now what Word is doing here is anyone's guess -- it has no problems with case insensitivity but it is not treating as equal any of what CompareString treats as equal. It is even ignoring diacritics and case and it still is not treating the strings as being equal.
Whatever Word, Access, and Visual Studio are doing here, they are doing it wrong. and Notepad is getting it half right.
It may be time to start getting more of these applications to do things correctly, huh?
(Notepad actually does get better in Vista -- check out the February CTP; I hope that the others will get better in future versions of Office and VS! And I especially hope that both MS and Google can get better at this in their respective search technologies, since some languages are really suffering in the meantime!)
This post brought to you by "å" (U+00e5, a.k.a. LATIN SMALL LETTER A WITH RING ABOVE)
# Christian Kaiser on 2 Feb 2006 4:44 AM:
# Michael S. Kaplan on 2 Feb 2006 10:01 AM:
# Michael S. Kaplan on 2 Feb 2006 10:37 AM:
2006/02/02 Another interview question
go to newer or older post, or back to index or month or day