SIAO to Search engines -- would you please normalize, already?
by Michael S. Kaplan, published on 2005/11/15 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/11/15/492301.aspx
Suzanne has been riffing on me in relation to Vietnamese and then she shifted over to talk about Google and other languages, so I thought I would riff off of her a bit. :-)
By the way Suzanne -- I did not find your terminology to be inaccurate; it was just different. I was explaining why I was confused!
I am going to take her string bãi biển from that first article and run it through various search engines in various forms, trying to look for some patterns. I am not using Suzanne's test (using Google Image) looking for pictures of beaches) since not all of the search engines have a comparable service. it is pretty easy to tell from the text excerpts if the search has found Vietnamese sights or not without too much trouble....
First, the engines:
Second, the strings to test:
- Normalization Form C: bãi biển (0062 00e3 0069 0020 0062 0069 1ec3 006e)
- Intermediate form : bãi biển (0062 0061 0303 0069 0020 0062 0069 00ea 0309 006e)
- " " " w/o accents : bai biên (0062 0061 0069 0020 0062 0069 00ea 006e)
- Normalization Form D: bãi biển (0062 0061 0303 0069 0020 0062 0069 0065 0302 0309 006e)
- " " " w/o accents : bai bien (0062 0061 0069 0020 0062 0069 0065 006e)
Now if you look at these five strings being tested #1, #2, and #4 are all canonically equivalent and thus should give identical results in search engines that conform to Unicode and its principles of canonical equivalence.
I will put the strings in double quotes for all search engines.
And here are the comparison results:
- Excite (using the default settings) does not understand Unicode from UNICEF, and the only reason it had 42 hits from Form C was that it munged that second word into "bi" and linked to inappropriate sights that I would never really be interested in looking at, even if they were in Vietnamese. This is (by the way) why the link is not live. :-)
- Excite (using advanced search, which defaults to all languages) fared a little better, with Suzanne's post high up on the list for #2 and #5.
- Lycos (which I did not put in the table) returned reults that were practically identical to ask.com.
- That one link on ask.com and lycos.com for #3 is to Suzanne's post, which I think is pretty funny. :-)
- Both altavista and yahoo appear to be using Unicode normalization, returning identical results for all canonically equivalent forms.
- Google appears to be stripping combining characters out of either its seasrch strings, its indexes, or both.
- Any claim that Google is normalizing appears to be crap -- at least insofar as one considers Unicode normalization. They are doing their own thing rather than the standard. But then they are only Associate members of Unicode, so I guess they aren't in all the way just yet....
- Microsoft (msn, start, and live) have a whole bunch of work to do and I am having trouble fathoming what precisely they are doing.
- Not enough of the Search community is taking the important of canonical equivalence seriously, to the detriment of many language communites, including Vietnamese.
- Until that time, a better keyboard solution for Vietnamesse in particular suddenly seems more and more compelling.
This post brought to you by "ổ" (U+1ed5, a.k.a. LATIN SMALL LETTER O WITH CIRCUMFLEX AND HOOK ABOVE)
# Andrew West on 15 Nov 2005 5:01 AM:
"Andrew West's claim that Google is normalizing appears to be crap"
NOT ME! "Andrew C." is someone else ... I don't mnake crap claims ;)
... and anyway he claimed that Google was *not* normalizing.
# Michael S. Kaplan on 15 Nov 2005 6:17 AM:
Sincerest aplogies, Andrew -- I have removed the bogus reference to you. It was Simon who was trying to make the claim (I am not sure who Simon is here).
Correction made, and again I am very sorry.
# Suzanne McCarthy on 15 Nov 2005 7:02 PM:
Isn't Simon's claim abaout Greek correct? Normalization is happening for some sequnces but not others. I get the same results in French and German with and without the precomposed 'vowels plus diacritics'.
# Suzanne McCarthy on 15 Nov 2005 7:09 PM:
BTW I forgot to add that I really appreciate this little experiment that you created here. Thank you, Mike.
# Michael S. Kaplan on 15 Nov 2005 7:30 PM:
If I had to guess, I would say they are not using Unicode normalization at all -- they are building their own homegrown system that happens to work in some cases but not others....
# Michael S. Kaplan on 15 Nov 2005 8:19 PM:
You are very welcome, Suzanne. It was fun. :-)
# Jim on 17 Nov 2005 5:44 AM:
Ahem, a microsoft employee rubbishing google through speculation...
# Michael S. Kaplan on 17 Nov 2005 8:31 AM:
This is not rubbish -- tet yourself. Canonically equivalent Unicode forms do not find the same pages.
If I were trying to diss just Google I would not have pointed out that Microsoft is also guilty though, obviously. Both of them need to do this.
# Mike on 17 Nov 2005 7:46 PM:
Very good info - hopefully this will be a big wakeup call for the big 2. I hope those that have the power to do something about this have been "educated" now.
Please consider a donation
to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.
go to newer or older post, or back to index or month or day