The search for someone who does Search correctly

by Michael S. Kaplan, published on 2007/02/26 03:31 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/02/26/1761051.aspx

Thinking about the issues involved with à ≠ a (unless à = a) made me think back to other posts where I mused about possible improvements in the search experience, such as posts like What about search for kids?, to give an example.

No one from Google or Microsoft or anywhere else seems to be taking any of the suggestions I have made over the past couple of years (hell, they won't even fix bugs like this one so if they have no time to fix bugs then they clearly have no time to take suggestions!).

for a moment, even though only 33⅓% of it or less of those points might actually be true. :-)

Now the fundamental reason why someone who is French might want to be able to look for e and find é while a Swede might be bothered that looking for a would find å has nothing to do with greater sensitivity for language; it has to do with the same issue that drives the fact that You can't ignore diacritics when a language does not give them diacritic weight.

(This same point was made in the comments of the first post such the one from Wilhelm)

In the end, the French sort treats é as an e with a bit of extra weight on it just like someone speaking English might while the Swedes treat å like a whole different letter, just like the Polish will treat ą like a different letter and so on. The way people think about language drives not only how they might expect search to be able to work but also how annoyed they will be if their expectations are not met.

These preferences do not always align with the content language since on the Internet they could be searching for anything. Perhaps a Dane will be less annoyed about a mistake related to ä if the content is not Danish, but they may still be a little annoyed at being mistaken for Germans (or even worse for Americans!), even if the annoyance is subconscious.

Now currently, search engines let people choose preferred languages for content, but there is no notion of preferred language bias for search that is separate.

This is bad thing for MANY Latin script, Cyrillic script, really any script shared between languages with different ideas on how collation ought to work. And no one seems to be able to handle the notion. worse, no one seems to even realize the notion exists.

Yet with simple knowledge of the "language world view" that a person has it would be much easier to help them both find what they want and not be given results that clearly do not match.

Beyond that, and thinking to the questions Laurent raised, are there additional options that could be made available?

For both ideas, it is not a simple matter of wanting or not wanting to conditionally ignore diacritics; no one would disagree that if you do include the diacritic you probably want them preferred. The real question is how you feel about the non-diacritic letters in those cases,and how you feel about both kind of letters when you search without the diacritics.

Add to all of this the fact that very few people can describe their preferences here, for good measure.

Okay, now that I have some of the ingredients in the pot, I'll let them simmer in your minds for a bit. So if you were the one on the hook to talk to Yahoo or Google or Microsoft or whoever (or if they were reading here), what would you tell them? How would you tell them to try to improve the International search experience?

I am not going to claim that Microsoft is doing better here, because as far as I can tell, EVERYONE sucks at the moment. The search for the people who can do the right thing here for search is still on....

This is as topic I will be coming back to, in other collation-type contexts as well. :-)

This post brought to you by ä (U+00e4, a.k.a. LATIN SMALL LETTER A WITH DIAERESIS)

For the most part Google's search works for me in this area, except where it goes overboard in assuming that because I'm looking for something with an å in it might mean that I might want my webpage UI language to change as well. (Google completely ignores user preferences and will use IP ranges to change your UI language, something that I and others have griped about on their forums).

But let's say I'm looking for a single word which happens to have a diacritic in its original "standard" form. Using my earlier example, I'll search on "Djurgaten" because a) I can't think how to compose a å on an English language keyboard, or b) hunting through CharMap for the symbol is just immensely inefficient, or even c) I don't know there's an å in the word, because none of my English-language tourist books etc use it.

So I type in "Djurgaten" and the search results bring me up all the EXACT matches, but at the top right, it has a list of near-matches (like "Djurgåten") it has found based on some normalisation heuristic/soundex/lexical lookup, with a checkbox next to each that allows me to include that as a search term, or even to search on all of them "Djurg?ten" etc.

NB certain special-purpose search sites do helpful query expansions as a matter of course: I'm thinking of some of the family-tree websites I use. Some do a soundexy thing that gets the inevitable typos sorted, and others use real-world knowledge.

I also designed query-expansion technology based on real-world concepts/thesauri that was initially used in Picture It! nearly a decade ago and then subsequently in Office Clip Art Gallery. If at first you don't find the exact string match, loosen* the query terms in a helpful way so that the end-user can find something of value, even if it simply suggests to them that alternate search values are needed. (*This is done automatically until at least one match is found, or the user can keep hitting "Give me more".) Users are much more forgiving of false positives than of zero/small result sets.

As part of the above search we also a) handled spelling variations and b) spellchecked the search terms with red squiggle feedback, long before Google provided "Do you really mean....?"

Well it's fine in that Google models my usage as a primarily English-language person, which (for a change) Microsoft doesn't.

I can even use Google's "-" syntax (sadly missing in Microsoft's search) to filter out unwanted hits caused by normalisation. A real case (from about 5 minutes ago) is looking for the town of Göd in Hungary. If I just search on Göd, then the results are overwhelmingly about God. However I can use "Göd -God" to get a more useful result set.

If I search on Göd in MSN, then it normalises it with Goed, and the results are actually dominated by hits on the acronymic GOED. I didn't find any accurate hits in the first 5 pages of results although the first contextual ad is for "Hotels in Göd". I have to search on "Göd Hungary" to get what I want.

Some Google experiments for Romanian (using București, the Romanian name for Bucharest):

U+0073 s

Bucuresti 26,500,000 results
"Bucuresti" 26,200,000 results

U+015F ş

Bucureşti 27,800,000 results
"Bucureşti" 6,580,000 results

U+0219 ș

București 28,100,000 results
"București" 11,100 results

Several things become visible:

without quotes the searches offer pretty much the same results
some results seem illogical. How can București find 28K answers, and Bucuresti 26.5? What affects the order? Why the most rare form (București) finds more results that the more popular ones?
there is a huge number of sites that don't use diacritics at all
people started using U+0219. This is amazing, considering that many fonts don't even have the glyph yet, and the whole tool-chain should be Unicode.

So, at least for Romanian, I would say Google does a good job.