Hats off to David Beaver

by Michael S. Kaplan, published on 2005/05/30 02:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/30/423136.aspx


Or should I say, Háts off to David Beaver? :-)

Over on the Language Log, David noticed some interesting issues with Google search in his post entitled PASS THE HÁT.

The basic issue comes up with Google's HYPHEN-MINUS operator. According to their documentation on negative terms, it behaves as follows:

If your search term has more than one meaning (bass, for example, could refer to fishing or music) you can focus your search by putting a minus sign ("-") in front of words related the meaning you want to avoid.

For example, here's how you'd find pages about bass-heavy lakes, but not bass-heavy music:

bass -music

Note: when you include a negative term in your search, be sure to include a space before the minus sign.

They make it sound simple, don't they? What David noted is that it is not simple, since the interaction is not as simple as additive and subtractive terms. The interaction related to the use of diacritics or even just the plain words is really quite fascinating.

Knowing what I do about collation and indexes, I could spend a lot of time in this area, trying to reverse engineer both the alogorithm and the indexes being used by looking at the results. But it is not quite that interesting -- I have my collation implementation to be thinking about, after all. :-)

But it is still fascinating to contemplate. My favorite is also the one David seems to like best:

Or achete -achete: infantile as I am, I really like this one since it produces 1,890,000 hits, while Google helpfully suggests the alternative acheter -acheter, which produces no hits, surely a new record of bad performance for a search enhancing feature.

It could add fascinating twists to GoogleFight, at least....

You can try playing with other operators, and wonder why a -a gets over 44,000,000 hits while +a -a gets none (especially since the plus sign is meant to force words that are usually ignored, which a usually is....


no comments

go to newer or older post, or back to index or month or day