SIAO to Search engines -- would you please normalize, already?

by Michael S. Kaplan, published on 2005/11/15 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/11/15/492301.aspx

Suzanne has been riffing on me in relation to Vietnamese and then she shifted over to talk about Google and other languages, so I thought I would riff off of her a bit. :-)

By the way Suzanne -- I did not find your terminology to be inaccurate; it was just different. I was explaining why I was confused!

I am going to take her string bãi biển from that first article and run it through various search engines in various forms, trying to look for some patterns. I am not using Suzanne's test (using Google Image) looking for pictures of beaches) since not all of the search engines have a comparable service. it is pretty easy to tell from the text excerpts if the search has found Vietnamese sights or not without too much trouble....

First, the engines:

Second, the strings to test:

Normalization Form C: bãi biển (0062 00e3 0069 0020 0062 0069 1ec3 006e)
Intermediate form : bãi biển (0062 0061 0303 0069 0020 0062 0069 00ea 0309 006e)
" " " w/o accents : bai biên (0062 0061 0069 0020 0062 0069 00ea 006e)
Normalization Form D: bãi biển (0062 0061 0303 0069 0020 0062 0069 0065 0302 0309 006e)
" " " w/o accents : bai bien (0062 0061 0069 0020 0062 0069 0065 006e)

Now if you look at these five strings being tested #1, #2, and #4 are all canonically equivalent and thus should give identical results in search engines that conform to Unicode and its principles of canonical equivalence.

I will put the strings in double quotes for all search engines.

And here are the comparison results:

Engine	#1 bãi biển	#2 bãi biển	#3 bai biên	#4 bãi biển	#5 bai bien
google	76,200	1,830	2,360	1,390	1,380
msn	14,702	190	678	0	678
start	14,703	191	679	0	679
live	14,702	190	678	0	678
altavista	59,700	59,700	509	59,700	1,040
ask	98	23	1	631	354
excite	42/18	3/3	55/0	3/0	64/64
yahoo	57,600	57,600	493	57,600	1,030

Conclusions:

Excite (using the default settings) does not understand Unicode from UNICEF, and the only reason it had 42 hits from Form C was that it munged that second word into "bi" and linked to inappropriate sights that I would never really be interested in looking at, even if they were in Vietnamese. This is (by the way) why the link is not live. :-)
Excite (using advanced search, which defaults to all languages) fared a little better, with Suzanne's post high up on the list for #2 and #5.
Lycos (which I did not put in the table) returned reults that were practically identical to ask.com.
That one link on ask.com and lycos.com for #3 is to Suzanne's post, which I think is pretty funny. :-)
Both altavista and yahoo appear to be using Unicode normalization, returning identical results for all canonically equivalent forms.
Google appears to be stripping combining characters out of either its seasrch strings, its indexes, or both.
Any claim that Google is normalizing appears to be crap -- at least insofar as one considers Unicode normalization. They are doing their own thing rather than the standard. But then they are only Associate members of Unicode, so I guess they aren't in all the way just yet....
Microsoft (msn, start, and live) have a whole bunch of work to do and I am having trouble fathoming what precisely they are doing.
Not enough of the Search community is taking the important of canonical equivalence seriously, to the detriment of many language communites, including Vietnamese.
Until that time, a better keyboard solution for Vietnamesse in particular suddenly seems more and more compelling.

This post brought to you by "ổ" (U+1ed5, a.k.a. LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE)

# Andrew West on 15 Nov 2005 5:01 AM:

"Andrew West's claim that Google is normalizing appears to be crap"

NOT ME! "Andrew C." is someone else ... I don't mnake crap claims ;)

... and anyway he claimed that Google was *not* normalizing.

# Michael S. Kaplan on 15 Nov 2005 6:17 AM:

Sincerest aplogies, Andrew -- I have removed the bogus reference to you. It was Simon who was trying to make the claim (I am not sure who Simon is here).

Correction made, and again I am very sorry.

# Suzanne McCarthy on 15 Nov 2005 7:02 PM:

Isn't Simon's claim abaout Greek correct? Normalization is happening for some sequnces but not others. I get the same results in French and German with and without the precomposed 'vowels plus diacritics'.

# Suzanne McCarthy on 15 Nov 2005 7:09 PM:

BTW I forgot to add that I really appreciate this little experiment that you created here. Thank you, Mike.

# Michael S. Kaplan on 15 Nov 2005 7:30 PM:

If I had to guess, I would say they are not using Unicode normalization at all -- they are building their own homegrown system that happens to work in some cases but not others....

# Michael S. Kaplan on 15 Nov 2005 8:19 PM:

You are very welcome, Suzanne. It was fun. :-)

# Jim on 17 Nov 2005 5:44 AM:

Ahem, a microsoft employee rubbishing google through speculation...

# Michael S. Kaplan on 17 Nov 2005 8:31 AM:

Hi Jim,

This is not rubbish -- tet yourself. Canonically equivalent Unicode forms do not find the same pages.

If I were trying to diss just Google I would not have pointed out that Microsoft is also guilty though, obviously. Both of them need to do this.

# Mike on 17 Nov 2005 7:46 PM:

Very good info - hopefully this will be a big wakeup call for the big 2. I hope those that have the power to do something about this have been "educated" now.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/07/16 SIAO is still underwhelmed by search engines (all of them)

2007/02/26 The search for someone who does Search correctly

2006/05/14 Harder intermediate forms of characters

2006/02/02 What is equal to some may not be equal to others

2005/12/31 Popularity hurts objectivity

2005/12/03 When even the bugs seem cool

go to newer or older post, or back to index or month or day