Making the THE silent, and at the same time not doing so, as the answer?

by Michael S. Kaplan, published on 2007/10/05 12:16 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/10/05/5295391.aspx


You can just ask Glenda Browne who is a professional web indexer, and who has just received the Ig Noble prize for literature this year for her 2001 literary contribution (The definite article: acknowledging 'The' in index entries).

This article about the award has probably the most info though the buzz on the web is unmistakable:

The conundrum with 'the' is deciding how to index names or titles that include it, she explains. Should 'The Who' be indexed as 'Who, The', or 'The Who'?

The key question is finding a method that would make the index easiest to use, Ms Browne says.

Her deceptively simple solution was published in 2001 in the journal The Indexer.

Index entries for names with 'the' in them should be indexed both with and without the 'the', so to speak, she says.

"I decided, look, people think in different ways, so let's put it in the index in two places."

It was a win-win solution, and a logical conclusion.

"Similar arguments apply to 'a' and 'an', but these are beyond the scope of this article," she notes in her paper.

I've actually been asked several times in the past why NLS API functions like CompareString/CompareStringEx don't ignore "initial THE" or "initial A/AN" in string comparisons. So I immediately felt grateful that colleague Sergey Malkin pointed out Glenda Browne's honor to me, because there is nothing like something I get to talk about! :-)

It is easy to point out that CompareString/CompareStringEx aren't just for building indexes, but that is kind of a throwaway answer that ignores something important -- the actual solution in the article still orders one entry as if the THE existed - which implies that the function has to keep acting the way it does. The whole point of the solution is to put it two places, and to get two different answers to the same problem simultaneously is a bit beyond what CompareString/CompareStringEx can do (how would you feel if when you ask about CSTR_LESS_THAN or CSTR_GREATER_THAN you got back an answer like CSTR_BOTH?).

Index building or list building might indeed want to solve this problem, but CompareString/CompareStringEx is not the place (other than perhaps some kind of magical NORM_IGNORE_ARTICLES flag as an option?).

And then of course there is the fact that a problem with THE is the tip of the tentacle here in a function that takes a locale as a parameter with the intent of providing correct results in that locale -- our mythical NORM_IGNORE_ARTICLES would need to support the notion across all languages, not just English, and every last EL/LA/LE and so on would have to be in there, all with the simple rule of being ignored when it is a prefix to a string.

And what about languages like Hebrew where these articles are actually prefixes (like ה for the). All well and good to say ignore them, but there are words that start with the letter too. Words that change meaning if it were to be gratuitously stripped....

Then consider whether it should also be ignored in the sentence as well -- after all, that is what search engines tend to do, right? Should The Dawn of the Dead and Dawn Dead sort together? Everybody knows that THE and OF are so unimportant that we don't even capitalize them in titles unless they start the title, so why not just rename that fictional flag to cover the wider class of words that are so ignorable and call it NORM_IGNORE_IGNORABLES? :-)

I find myself feeling a little sad for Ms. Glenda Browne since as a single article in a publication about indexing it is hardly out of place or even silly, yet in the limelight in which it had been thrust it is reduced to a punchline.

Maybe I am self-consciously aware of my 2000+ punchlines here which in the context of SiaO are just snapshots into a disturbing mind (my own) but any one of which could be twisted into the same kind of joke, on me. It is why I dread the day that someone finally takes on a Wikipedia article about me, because I want to be so much more of a Raymond Chen than a Michael Everson (by which I mean less of a center of controversy in the face of a biographical piece that becomes more autobiographical if an intent to fix inaccuracies is taken to its logical but absurd conclusion).

And no, that was not an invitation. Truly not.

But you get my point, I hope. I say that we ought to give Glenda Browne a break. And a half, once this one has finished making the rounds....

 

This post brought to you by ה (U+05d4, a.k.a. HEBREW LETTER HE)


John Cowan on 5 Oct 2007 2:57 PM:

Another problem of the same flavor, pointed out to me by Joe Zitt:  Jethro Tull the inventor needs to be indexed as "Tull, Jethro", whereas Jethro Tull the band needs to be indexed as "Jethro Tull".

Michael S. Kaplan on 5 Oct 2007 3:08 PM:

Of course people always assume Ian Anderson is "Jethro" and they would ask the Pink Floyd band members "which one is Pink?" which ended up as a line in Have a Cigar....

Carl on 6 Oct 2007 7:15 PM:

ITunes handles this interestingly. By default, it gives everything a "Sort Name," "Sort Artist," "Sort Album," etc. For artist names, the sort key will move what the locale considers to be articles to the end, so that by default "The Beatles" have a sort artist of "Beatles, The". Of course, these sort fields can be manually changed, so if you want the Beatles to show up under "T" on your iPod, you just need to manually change the sort artist box in the song's properties. The nice thing about having a sort field is that it allows Japanese name sorting to be non-useless. In fact, a determined Japanese user could tag the Beatles as びーとるず and then even have all their Western artists sorted according to the "50 sounds" method. It's nice, though it could be a little better automated by GraceNote.

ReallyEvilCanine on 10 Oct 2007 8:04 AM:

More fun: languages in which the article is a suffix. Bonus points when that language declines those words.

Glenda Browne on 3 Nov 2007 11:24 PM:

Hi Michael,

Thanks for your comments on THE, and for pointing out that it is not just an issue in printed indexes. And let me add that you don't need to worry about me needing a break! Most of the commentary has been along the lines of 'yeh, that's caught me out too'. The spirit of the Igs is to delight in the winners, and to laugh with them.

Cheers,

Glenda.

PS Have quoted you at http://www.webindexing.biz/joomla/index.php?option=com_content&task=view&id=458&Itemid=1


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/10/06 Wondering whether ignored THE would be less expensive than explicit THE

go to newer or older post, or back to index or month or day