The "Taking the THE out of the string" thing, aka The importance of ignoring the unimportant

by Michael S. Kaplan, published on 2010/06/12 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/06/12/10023838.aspx


The question was:

We have a scenario where we would like to identify/remove ‘The’ and other articles at the beginning of an Artist name (for example) property value. I figured I’d ping to see if any such “article list” exists already somewhere in windows for different languages, or if anyone knows of other components that already do this. It looks like Windows Media Player does something similar, since the sort order of Artists appears to ignore the preceding ‘the’ (something Windows Explorer does not do).

Now that is a question I have heard before, now and again.

It is something that is not built into the platform, by and large

You will see it here and there, like in Zune or Media Player. The typical way that it is handled in these kinds of applications is to:

  1. Write a function to adjust the string in way that removes such values (a process that is unfortunately often called normalizing, to my dismay!);
  2. Use something like LCMapStringW to retrieve sort keys based on this normalizadjusted string;
  3. Do all sorting with this sort key.

Though some would instead modify the string in step 1 to put articles at the end, thus The Winds of War becomes Winds of War, The. I don't like this way as much myself but others do....

Now even within the English language there are many different operations one could consider doing as a part of this normalizadjusting - like sorting the number 3 in the same place as the word three and other such things.

It will in fact seem eerily familiar to people who are aware of features Exchange exposes i its Microsoft.Exchange.WebServices.Data.CompleteName class and in particular its YomiGivenName and YomiSurname properties (which were designed to allow storage of a string containing the Japanese pronunciation of part of a name in order to allow for the easy indexing/sorting of the underlying data).

Now whether one would choose to store the name separately (note that my earlier suggestion did not do this) largely depends on several factors:

Obviously the trimming of articles is much more for the first of these points while the Japanese pronunciation (possibly suggested by the IME) is much more for the second and third.

In a way the whole Yomi* naming of these properties in Exchange is unfortunate since as a feature it could be useful in many other contexts:

and so on. I don't know if speakers of these other languages would find using a property with a Yomi prefix offensive or not. Anyone know for sure here?

In most cases the documentation doesn't even hint at the meaning, though KB articles are a bit freer on this point, e.g. KB 298057 (OL2002: 'Given Yomi' Appears As a 'Sort items by' Option for Contacts):

Yomi is currently generated only for the Japanese locale. With Unicode support in Outlook 2002, users can put in Japanese names in any locale. Outlook generates the Yomi, and attaches it with the contact. It is not seen until a Japanese contact form is used.

Obviously they decided not to even expose it for other languages, unfortunately.

Would it have been that hard to just call it a Pronunciation property and make it something optional to let anyone use? This would have made a lot more sense for everyone....

Well, except for the original question about sorting in a way that ignores articles - I'm not sure everyone would think of that as a "pronunciation", after all!

Back when I was on the NLS team, back when that team owned the managed chunk in .Net that does globalization support to, I once recommended a Pronunciation property be added to System.String, to be used any time one was sorting a string and the property was filled in (this was after a long conversation about the subject from one of the people in the Japanese subsidiary). But I couldn't interest anyone in making System.String "heavier" with a property so specific in both target market and usage.

So as a feature this has largely been relegated to a small number of applications, like Outlook and Exchange.

Just like the "ignore articles" feature only shows up occasionally in a few applications, now that I think of it.

Now I don't know if the "Ignore articles" feature in Media Player or Zune ignores articles and such in other languages - does anyone know if they do?

Although I agree it that as a feature it doesn't belong in low-level components like System.String or NLS, It really seems like a rather potentially compelling feature for ELS (Extended Linguistic Services, which I have mentioned previously), given its (in many cases language-specific) pieces....


oldnewthing on 12 Jun 2010 8:11 AM:

The English language has a lot of "normalization" rules. In addition to ignoring lead articles and spelling out numbers, you also spell out abbreviations ("St." sorts as "Saint"), insert the letter "a" into surnames ("McDonald" sorts as "MacDonald"), ignore lead punctuation, and sometimes have to apply special-case rules, such as sorting "The Former Yugoslav Republic of Macedonia" under T! We're experiencing an odd effect now: Since computers do such a bad job at implementing these traditional rules, people have gotten accustomed to the bad version and think the traditionally correct version is wrong!

Michael S. Kaplan on 12 Jun 2010 8:48 AM:

All of that feels like a compelling ELS feature to me!

Alan Braggins on 12 Jun 2010 9:15 AM:

There was a time when Amazon (or at least Amazon UK) ignored "The" in searches. Even if it was in quotes.

You can imagine how well that worked for The The.


go to newer or older post, or back to index or month or day