by Michael S. Kaplan, published on 2006/08/05 10:24 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/08/05/689506.aspx
No, this post is not making the point that Chris Wetherell did and folks picked up on a couple of years ago, like here. Let me exaplain....
So I found myself IM'ing with Melanie Spiller as I have been doing intermittently since that dinner last month, and thinking about how refreshing (and entertaining!) it is every once in a while to have a whole new source of information. Like she pointed out a bumper sticker she had seen that she thought I might enjoy:
What if the hokey pokey is what it's really all about?
That is freaking hilarious, in my opinion. Perhaps you don't agree, and I won't argue with you since I have no sense of humor, really. But if I was ever going to put a bumper sticker on my car, that would be it. I even wrote it on my white board and Kieran, who happened to be passing by, agreed that it was pretty awesome.
So what does this have to do with Google not getting blogs? :-)
Well, I did that ubiquitous thing that is going to make Google lose their trademark someday just like Xerox might -- I google'd this phrase. And the results were surprising to me:
1 - 12 of about 173? WTF?
Clearly Google is smart enough to recognize that there is a pattern but not smart enough to identify that it might be due to the modern equivalent of pages with frames that happen to share the same frame text -- not smart enough to point out which links have the repeats. The subsidary info that every blog page might have? It can't call out why they might be the same, why it might have lumped them all together?
This is not proof that it Google doen't get blogs by the way. If anything it is proof that thay do get blogs, at least in the sense that they can see patterns and such. So what am I rambling on about?
Well, I took another phrase, one that appears in the disclaimer text of my own blog:
not for use on unexplained calf pain.
and I google'd it, with the following results:
1 - 1 of about 24,100? Serious WTF?
How did it pick out over 24,000 pages in a blog that has only a bit over 1,200 posts? And how did it decide that the Cantonese IME post is the "most relevant" of the about 24,100 that it had?
Simple. You can see it yourself if you click on the link to repeat the search with the omitted results included and scan around the results a bit. Though of course this post may throw the balance off a bit!
It is counting every page. And every month link in the archives on every page. And every category link on every page. And so on.
If you scroll to the end of the results, it is eventually smart enough to see something is going on and avoids the recursive freaking hole it has dug for itself and actually stops at around 1,000. Which would still be about 995 after even some of the dimmest children will realize what is going on and be a bit smarter about how it describes things.
And if you search on actual content like the title of a post or text inside of a post, you see a different problem -- it is actually looking at every RSS link off of every page, too. And indexing all of those as well. Fools like me who actually aggregate full posts are punished the most here, and Google will provide links to each of those pages, too.
We are impressed with Rainman in particular and with some of the capabilities of the more talented Idiot Savants in general. But we eventually get over that and realize that the first word in that title is Idiot and that what is widely believed to be the most talented searching algorithm could perhaps become a bit smarter. I'd find it much more impressive than throwing half a million servers at the problem, were someone to ask me....
Not to further cast asparagus, but search.msn.com, for example, stops after only ten results across five different domains. Not such an idiot, msn is, huh? :-)
This post brought to you by ෝ (U+0ddd, a.k.a. SINHALA VOWEL SIGN KOMBUVA HAA DIGA AELA-PILLA)
# JohnCJ on 5 Aug 2006 11:05 AM:
# Michael S. Kaplan on 5 Aug 2006 11:53 AM:
# Nick Lamb on 6 Aug 2006 12:02 AM:
# Melanie Spiller on 6 Aug 2006 3:53 PM:
# Melanie Spiller on 6 Aug 2006 4:04 PM:
# Richard Smith on 7 Aug 2006 6:07 AM:
# Michael S. Kaplan on 7 Aug 2006 9:52 AM:
# Johannes Rössel on 11 Oct 2009 1:22 PM:
I noticed stupid search engine behavior before, didn't look much at others but Google is a pretty bad offender here. It has a tendency to direct people to the front page of my blog, or one of the later pages (?page=2) ... those are, by their very nature, pretty dynamic (well, not so much in recent times, but still). Also the front page (or category pages, etc.) tends to aggregate a bunch of keywords belonging to different posts which Google helpfully sees belonging to a single relevant page. So one could actually land on my site with a search like "postscript batch array". Never talked about those things together but separately. Yuck.
I never really found ot how to teach them only to consider individual articles or pages while ignoring everything that aggregates more than one of them.
referenced by