Wondering whether ignored THE would be less expensive than explicit THE

by Michael S. Kaplan, published on 2007/10/06 21:59 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/10/06/5328496.aspx


Fritz, in response to my Making the THE silent, and at the same time not doing so, as the answer?, asked via the contact link what would be the effect of Mark Liberman's Naming Opportunities over on Language Log.

Hmmm.

I admit I am under-equipped here, and not just because I am not a lawyer, and not just because I have seldom if ever found myself in direct contact with anyone named Fritz. :-)

But I think we're safe. It is not like they are doing anything special with the word THE if they are ignoring it indexes. In fact, I would make the claim that ignored the is either exempt from royalties or at least deserves a smaller royalty amount? :-)

But ignoring that issue, if on indexes only the sentence with the ignored THE, then the sort key would be at least three bytes smaller and maybe more -- over a huge dataset the reduced storage costs could at  some point become significant (though if you follow the Glenda Browne suggestion and index it twice the storage costs would be higher)....

This post brought to you by ® (U+00ae, a.k.a. REGISTERED SIGN)


Michael Dunn_ on 7 Oct 2007 3:11 PM:

I wonder if any of the "ignore 'the' but don't really ignore it" algorithms have problems when they encounter the band named "The The." :)


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day