SIAO to Search engines -- would you please normalize, already?

by Michael S. Kaplan, published on 2005/11/15 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/11/15/492301.aspx


Suzanne has been riffing on me in relation to Vietnamese and then she shifted over to talk about Google and other languages, so I thought I would riff off of her a bit. :-)

By the way Suzanne -- I did not find your terminology to be inaccurate; it was just different. I was explaining why I was confused!

I am going to take her string bãi biển from that first article and run it through various search engines in various forms, trying to look for some patterns. I am not using Suzanne's test (using Google Image) looking for pictures of beaches) since not all of the search engines have a comparable service. it is pretty easy to tell from the text excerpts if the search has found Vietnamese sights or not without too much trouble....

First, the engines:

  1. google.com
  2. search.msn.com
  3. start.com
  4. live.com
  5. altavista.com
  6. ask.com
  7. excite.com
  8. yahoo.com

Second, the strings to test:

  1. Normalization Form C: bãi biển (0062 00e3      0069 0020 0062 0069 1ec3           006e)
  2. Intermediate form   : bãi biển (0062 0061 0303 0069 0020 0062 0069 00ea 0309      006e)
  3.  " "  " w/o accents : bai biên (0062 0061      0069 0020 0062 0069 00ea           006e)
  4. Normalization Form D: bãi biển (0062 0061 0303 0069 0020 0062 0069 0065 0302 0309 006e)
  5.  " "  " w/o accents : bai bien (0062 0061      0069 0020 0062 0069 0065           006e)

Now if you look at these five strings being tested #1, #2, and #4 are all canonically equivalent and thus should give identical results in search engines that conform to Unicode and its principles of canonical equivalence.

I will put the strings in double quotes for all search engines.

And here are the comparison results:

Engine

#1

bãi biển

#2

bãi biển

#3

bai biên

#4

bãi biển

#5

bai bien

google 76,200 1,830 2,360 1,390 1,380
msn 14,702 190 678 0 678
start 14,703 191 679 0 679
live 14,702 190 678 0 678
altavista 59,700 59,700 509 59,700 1,040
ask 98 23 1 631 354
excite 42/18 3/3 55/0 3/0 64/64
yahoo 57,600 57,600 493 57,600 1,030

Conclusions:

 

This post brought to you by "" (U+1ed5, a.k.a. LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE)


# Andrew West on 15 Nov 2005 5:01 AM:

"Andrew West's claim that Google is normalizing appears to be crap"

NOT ME! "Andrew C." is someone else ... I don't mnake crap claims ;)

... and anyway he claimed that Google was *not* normalizing.

# Michael S. Kaplan on 15 Nov 2005 6:17 AM:

Sincerest aplogies, Andrew -- I have removed the bogus reference to you. It was Simon who was trying to make the claim (I am not sure who Simon is here).

Correction made, and again I am very sorry.

# Suzanne McCarthy on 15 Nov 2005 7:02 PM:

Isn't Simon's claim abaout Greek correct? Normalization is happening for some sequnces but not others. I get the same results in French and German with and without the precomposed 'vowels plus diacritics'.

# Suzanne McCarthy on 15 Nov 2005 7:09 PM:

BTW I forgot to add that I really appreciate this little experiment that you created here. Thank you, Mike.

# Michael S. Kaplan on 15 Nov 2005 7:30 PM:

If I had to guess, I would say they are not using Unicode normalization at all -- they are building their own homegrown system that happens to work in some cases but not others....

# Michael S. Kaplan on 15 Nov 2005 8:19 PM:

You are very welcome, Suzanne. It was fun. :-)

# Jim on 17 Nov 2005 5:44 AM:

Ahem, a microsoft employee rubbishing google through speculation...

# Michael S. Kaplan on 17 Nov 2005 8:31 AM:

Hi Jim,

This is not rubbish -- tet yourself. Canonically equivalent Unicode forms do not find the same pages.

If I were trying to diss just Google I would not have pointed out that Microsoft is also guilty though, obviously. Both of them need to do this.

# Mike on 17 Nov 2005 7:46 PM:

Very good info - hopefully this will be a big wakeup call for the big 2. I hope those that have the power to do something about this have been "educated" now.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/07/16 SIAO is still underwhelmed by search engines (all of them)

2007/02/26 The search for someone who does Search correctly

2006/05/14 Harder intermediate forms of characters

2006/02/02 What is equal to some may not be equal to others

2005/12/31 Popularity hurts objectivity

2005/12/03 When even the bugs seem cool

go to newer or older post, or back to index or month or day