by Michael S. Kaplan, published on 2007/02/26 03:31 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/02/26/1761051.aspx
Thinking about the issues involved with à ≠ a (unless à = a) made me think back to other posts where I mused about possible improvements in the search experience, such as posts like What about search for kids?, to give an example.
No one from Google or Microsoft or anywhere else seems to be taking any of the suggestions I have made over the past couple of years (hell, they won't even fix bugs like this one so if they have no time to fix bugs then they clearly have no time to take suggestions!).
But let's pretend for just a moment that I am
for a moment, even though only 33⅓% of it or less of those points might actually be true. :-)
Now the fundamental reason why someone who is French might want to be able to look for e and find é while a Swede might be bothered that looking for a would find å has nothing to do with greater sensitivity for language; it has to do with the same issue that drives the fact that You can't ignore diacritics when a language does not give them diacritic weight.
(This same point was made in the comments of the first post such the one from Wilhelm)
In the end, the French sort treats é as an e with a bit of extra weight on it just like someone speaking English might while the Swedes treat å like a whole different letter, just like the Polish will treat ą like a different letter and so on. The way people think about language drives not only how they might expect search to be able to work but also how annoyed they will be if their expectations are not met.
These preferences do not always align with the content language since on the Internet they could be searching for anything. Perhaps a Dane will be less annoyed about a mistake related to ä if the content is not Danish, but they may still be a little annoyed at being mistaken for Germans (or even worse for Americans!), even if the annoyance is subconscious.
Now currently, search engines let people choose preferred languages for content, but there is no notion of preferred language bias for search that is separate.
This is bad thing for MANY Latin script, Cyrillic script, really any script shared between languages with different ideas on how collation ought to work. And no one seems to be able to handle the notion. worse, no one seems to even realize the notion exists.
Yet with simple knowledge of the "language world view" that a person has it would be much easier to help them both find what they want and not be given results that clearly do not match.
Beyond that, and thinking to the questions Laurent raised, are there additional options that could be made available?
For both ideas, it is not a simple matter of wanting or not wanting to conditionally ignore diacritics; no one would disagree that if you do include the diacritic you probably want them preferred. The real question is how you feel about the non-diacritic letters in those cases,and how you feel about both kind of letters when you search without the diacritics.
Add to all of this the fact that very few people can describe their preferences here, for good measure.
Okay, now that I have some of the ingredients in the pot, I'll let them simmer in your minds for a bit. So if you were the one on the hook to talk to Yahoo or Google or Microsoft or whoever (or if they were reading here), what would you tell them? How would you tell them to try to improve the International search experience?
I am not going to claim that Microsoft is doing better here, because as far as I can tell, EVERYONE sucks at the moment. The search for the people who can do the right thing here for search is still on....
This is as topic I will be coming back to, in other collation-type contexts as well. :-)
This post brought to you by ä (U+00e4, a.k.a. LATIN SMALL LETTER A WITH DIAERESIS)
# Mike Williams on 26 Feb 2007 6:52 AM:
For the most part Google's search works for me in this area, except where it goes overboard in assuming that because I'm looking for something with an å in it might mean that I might want my webpage UI language to change as well. (Google completely ignores user preferences and will use IP ranges to change your UI language, something that I and others have griped about on their forums).
But let's say I'm looking for a single word which happens to have a diacritic in its original "standard" form. Using my earlier example, I'll search on "Djurgaten" because a) I can't think how to compose a å on an English language keyboard, or b) hunting through CharMap for the symbol is just immensely inefficient, or even c) I don't know there's an å in the word, because none of my English-language tourist books etc use it.
So I type in "Djurgaten" and the search results bring me up all the EXACT matches, but at the top right, it has a list of near-matches (like "Djurgåten") it has found based on some normalisation heuristic/soundex/lexical lookup, with a checkbox next to each that allows me to include that as a search term, or even to search on all of them "Djurg?ten" etc.
NB certain special-purpose search sites do helpful query expansions as a matter of course: I'm thinking of some of the family-tree websites I use. Some do a soundexy thing that gets the inevitable typos sorted, and others use real-world knowledge.
I also designed query-expansion technology based on real-world concepts/thesauri that was initially used in Picture It! nearly a decade ago and then subsequently in Office Clip Art Gallery. If at first you don't find the exact string match, loosen* the query terms in a helpful way so that the end-user can find something of value, even if it simply suggests to them that alternate search values are needed. (*This is done automatically until at least one match is found, or the user can keep hitting "Give me more".) Users are much more forgiving of false positives than of zero/small result sets.
As part of the above search we also a) handled spelling variations and b) spellchecked the search terms with red squiggle feedback, long before Google provided "Do you really mean....?"
# Michael S. Kaplan on 26 Feb 2007 11:01 AM:
For the most part Google's search works for me in this area...
Well, until you have complaints that exactly match the problems I am talking about with the confusion of content language/search language, the bad rules about ignoring diacritics or not, and so on.
So in other words, everything is fine until it isn't. :-)
# Mike Williams on 26 Feb 2007 11:17 AM:
Well it's fine in that Google models my usage as a primarily English-language person, which (for a change) Microsoft doesn't.
I can even use Google's "-" syntax (sadly missing in Microsoft's search) to filter out unwanted hits caused by normalisation. A real case (from about 5 minutes ago) is looking for the town of Göd in Hungary. If I just search on Göd, then the results are overwhelmingly about God. However I can use "Göd -God" to get a more useful result set.
If I search on Göd in MSN, then it normalises it with Goed, and the results are actually dominated by hits on the acronymic GOED. I didn't find any accurate hits in the first 5 pages of results although the first contextual ad is for "Hotels in Göd". I have to search on "Göd Hungary" to get what I want.
# Michael S. Kaplan on 26 Feb 2007 2:07 PM:
Meanwhile, back to the rest of the world....
Any comments from native speakers of other languages. I think there has been entirely too much emphasis on English (not just in prolific commenters, but also in multimillion/billion dollar companies!).
# Michael S. Kaplan on 26 Feb 2007 5:13 PM:
And to get back to the incorrect invocation of deity, most platforms do okay with "Göd" if you change the search language to Hungarian, the problem there being that it is also the UI lasnguage, which points back to [one of] the same fundamental problems I m talking about here.... :-)
# Mihai on 26 Feb 2007 5:53 PM:
Actually, on Google "Göd" (with quotes) will give exact matches (same as Göd -God, but in different order, for some reason)
# Mihai on 26 Feb 2007 6:08 PM:
Some Google experiments for Romanian (using București, the Romanian name for Bucharest):
Several things become visible:
So, at least for Romanian, I would say Google does a good job.
# Michael S. Kaplan on 26 Feb 2007 8:42 PM:
Hmmmm... it looks like it just always strips them in this case? Weird....
# Pavanaja U B on 27 Feb 2007 2:10 AM:
Searching for हिंदी and हिन्दी (Hindi in Devanagari script) give different results by all search engines. Linguistically, both are same.
# Michael S. Kaplan on 27 Feb 2007 2:26 AM:
U+0939 U+093f U+0928 U+094d U+0926 U+0940
U+0939 U+093f U+0902 U+0926 U+0940
Is ANUSVARA really expected to be equivalent to NA + DA?
# darcy toberty on 25 Jan 2008 12:31 AM:
lived in yorba linda calif in 1978
go to newer or older post, or back to index or month or day