by Michael S. Kaplan, published on 2005/12/31 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/12/31/508290.aspx
I have been fairly critical of both Google's and Microsoft's search engines in the area of searching on many different language and Unicode issues (e.g. the Sorting It All Out to Search Engines post). I personally look forward to the day when I can praise the work that either or both search engines do to better support the work that their corporate entities pay USD $12,000 and USD $2,000 a year for, respectively. It is a little embarassing that either one of them are not doing better here....
Now folks over on Language Log have on many occasions used search engines like Google to quickly look at frequency of use and other linguistic issues. They have never lost site, however, of the fact that there are various limitations to this approach (good examples are Benjamin Zimmer's post entitled Googlinguistics: the good, the bad, and the ugly and Mark Leiberman's post entitled More arithmetic problems at Google).
I just bumped into another limitation in the last month though, one I thought I'd blog about....
In this last month I have run across people reporting potential limitations/issues/bugs in language/locale-specific formatting, keyboards, locale data, calendars, and/or collation for Georgian, Armenian, Latvian, Japanese, Korean, Macedonian, and others. And although I am not a linguist, I do have those pesky delusions of linguistic aptitude, so I tried to do a little research on many of these issues.
What I found was that it is hard to separate what is 'done in the wild' to see if Microsoft is doing the right thing since Microsoft's products are such a large part of 'the wild' in this context. Search engines, which index the web, can't really make that separation since there really is no explicit marking of content to know the difference and even if there were it is not a meaningful distinction since there is no way to separate 'correct' usage from 'Microsoft' usage since the Microsoft usage may in fact be correct!
I was a bit staggered by the fact that the very popularity of the platform made it more challenging to research questions about the platform.
Isn't William Shatner the one who said "Irony can be pretty ironic, sometimes" ? :-)
In any case, I do not mind that I had to do a bit more formal of a job in an actual library to do some of the research, it took me back to when I was in school. And I have kept my delusions as I reported back on what I had found and people were both receptive to and encouraging about what I had found.
Though I realize that the days where the library will work could also be numbered, as libraries fight to stay relevant in the eyes of people who find a Google or a Wikipedia search to be simply easier than using a library.
In the wider sense, I realized this is not just a Microsoft problem. I mean, can any search engine hope to answer generic questions about whether Google is finding the correct result sets? If it were not for issues such as Unicode canonical equivalence that are beyond the reach of just the result set, then I wouldn't really have a reason to criticize except on individual search results. And that would seem just plain silly to most people.
I guess there is a wider truth here that everyone realizes -- popularity makes it difficult to find objectivity.
If you had told me that a month ago, a year ago, ten years ago I would have said "Duh!" so why it is such a shock dressed in other clothing is beyond me....
This post brought to you by "ओ" (U+0913, a.k.a. DEVANAGARI LETTER O)