by Michael S. Kaplan, published on 2005/03/13 06:15 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/03/13/394822.aspx
Prior posts in this series:
Before you find, or search, you have to *index* (or, Language-specific processing #0)
I coffee, therefore IFilter (or, Language-specific processing #1)
Now note that this is not a "how-to" post since the docs kind of do that. It is also not a sample since the doce have one of those too. But I going to talk about stemmers for a bit -- explain what they are and talk about some of the interesting issues with them....
I will start by saying that I am not a linguist (in case anyone had not yet figured it out yet). But I do have some delusions of linguistic aptitude1.
Anyway, with all of that said, let's get started.
Looking at the IStemmer interface, the first and most obvious question that may come to mind is "what the hell is a stemmer?".
Probably easiest to go right to the docs for starters, as they explain a bit about what stemmers do without saying too much:
Stemmers perform morphological analysis and apply grammatical rules to generate a list of alternative, or inflected, forms for words. Alternative forms often have the same stem or base form. By generating the inflected forms for a word, Indexing Service returns query results that are statistically more relevant to a query. For example, a full-text query for "swim meet" matches documents that contain "swim, swim's, swims, swims', swimming, swam, swum" or "meet, meet's, meets, meets', meeting, met" and combinations of these terms.
Some languages require that inflected terms be generated at both index time and query time for both standard and variant inflections. In this case, stemming happens with the word breaker component, with minimal stemming work in the actual stemmer. For example, the Japanese word breaker performs stemming during both index creation and querying to enable a query to find different inflected forms of the search terms.
So you can think of a stemmer as a sort of "parser" that can discern the base or "stem" form of a word given one of the many possible inflections. Of course when to do the stemming really can be based on the cost of the work in terms of runtime performance versus the need to save space by not building huge indexes -- the size versus speed" issue is a very real one here. In a way it is unfortunate that the text implies that Japanese is done a particular way because the language requires it, rather than the underlying size/speed debate.
Knowledge of rules about the language and its grammar are obvious prerequisites. After all, a generic stemmer like the neutral one that is built in to Windows cannot do too much in the way of morphological analysis without the context of a language to use. If you are building a stemmer then you have to decide whether the language needs stemming done in the word breaker at index time or in the actual IStemmer implementation at runtime for querying (or both, in some cases). The results are often taken for granted in searches (if we look for "swim" we are not surprised to results like "swimming", "swam", and "swum" were also found. That is because a superior stemmer for a language will have both
An interesting question to ask at this point is what one would do with cases where the rules are followed but an exception exists. We routinely see toddlers say things like "I goed to the store" and know that they are saying "I went to the store." They are learning the rules, that's all. So why make the stemmer less smart than the parent who can understand what their child is saying, or for that matter the child who is only guilty of following the rules?
(hint for the answer to the above question -- neither MSN Search nor Google handle 'goed' in this context; they do not stem it to 'go' -- should we wonder about advanced search technology that cannot keep up with the little tykes who are probably already using a computer, at least to play Magic School Bus or whatever? <grin>)
To take a more interesting example, let's take another look at the Jack Winter's How I Met My Wife story from the 23 July 1994 New Yorker (previously discussed in the context of machine translation challenges in Saying all those nouns over and over again...). You can pop over to one of those links to refresh your memory if you like. :-)
Aha, there is now more that we can say about what is happening here than last time. If it was important (for whatever reason) to create an IStemmer implementation that would go this extra step to stem words down past the "legal" word usage level where it could actually find all of these non-standard terms later, then it is possible to imagine doing so. In practical terms, this level of stemming is seldom useful to do intentionally, but it is easy to see how hard it would be to never have it happen at all in any kind of rule based system that understands how a language works.
Let's take a quick look at the lrsample sample that exists in the Platform SDK for Index Server -- basically it is the word breaker and stemmer sample described here.
First, you have to install the Core SDK component of the Platform SDK from here. I am not recommending you install the PSDK just for the one sample, unless you really do want to see a simple one in action. I'll likely be talking more about pieces of this sample on other days). There are also other samples there worth taking a look at in the area of filtering, for example.
The sample's simple stemmer is sitting in the code in (lrsample.CXX) and consists of just a few lines:
const WCHAR * aStems[][ cMaxStemForms ] =
{
{ L"abide", L"abode", L"abided", L"abiding", L"abides" }, // 0
{ L"bat", L"bats", L"batted", L"batting" }, // 1
{ L"batch", L"batches", L"batched", L"batching" }, // 2
{ L"bear", L"bears", L"bore", L"borne", L"born" }, // 3
{ L"begin", L"began", L"begun", L"beginning", L"begins" }, // 4
{ L"dance", L"danced", L"dances", L"dancing" }, // 5
{ L"heave", L"heaved", L"hove", L"heaves", L"heaving" }, // 6
{ L"hero", L"heroes" }, // 7
{ L"keep", L"keeps", L"kept", L"keeping" }, // 8
{ L"misspell", L"misspelled", L"misspelt", L"misspelling", L"misspells" }, // 9
{ L"plead", L"pleaded", L"pled", L"pleading", L"pleads" }, // 10
{ L"run", L"runs", L"ran", L"running" }, // 11
{ L"swim", L"swam", L"swum", L"swimming", L"swims" }, // 12
{ L"underlie", L"underlay", L"underlain", L"underlying", L"underlies" }, // 13
};
This stemmer is indeed almost exclusively looking at exceptional edge cases in English. If you were creating an actual stemmer you would almost certainly want a dictionary of the information kept in some separate data file rather than defined in code like this one. You can even see how one could use no rules and only have extensive lists of the words in a language and yet be effective for normal text while doing nothing useful for things like the Winter piece.
It feels like (and is!) quite a brute force technique, and even developers who are not linguists realize that simple rules like the "-s" suffix meaning plural and the "-ed" suffix meaning past tense ar perhaps more accessible. Would it not make sense to generate such terms automatically rather than requiring each such term be explicitly added? Or are we so afraid of those cases where we"goed to the meeting" that we want to banish them all and have our prescriptive grammarian's hats on when designing a word breaker.
The interesting point here is that if one does have knowledge of a language and a desire to see that language supported, one can add the component to make it happen. And you can do this in Windows today. Microsoft has taken steps here toward getting out of the way, huh? :-)
Of course this only solves the problem for linguists who are developers, developers who are linguists, or teams that contain both and maintain cordial relations. I'm not saying this doesn't ahppen, but when it does there are so many potenial project of interest that developing stemmers may not be the highest item on the list....
In the next post in the series I'll talk a bit about word breaking, and then in future posts after that I'll talk about other issues that exist here. It is a rich area.:-)
1 - Also, I am proud to be the "data owner" of collation of some of the languages that require no linguistic aptitude whatsoever. And I am the one behind the "teaching the computer 100 lies to make the it tell the truth" strategy (more on this another day!).
This post brought to you by "ם" (U+05dd, HEBREW LETTER FINAL MEM)
A letter that often forms part of the suffix used to create the masculine plural form of words (e.g. "ילד" boy becomes "ילדים" boys/children)
# bg on 13 Mar 2005 11:28 AM:
# Michael Kaplan on 13 Mar 2005 11:32 AM:
# Dean Harding on 13 Mar 2005 4:06 PM:
# Michael Kaplan on 13 Mar 2005 4:21 PM:
referenced by
2008/12/05 "If you haven't had sex is six months..." the IM read
2007/07/24 Pluralization(s) can be singularly difficult
2006/11/12 If I wasn't watching her blog, could you really claim I was filtering the list of blogs I read?
2006/08/20 Avrupalılaştıramadıklarımızdanmışsınızcasına
2006/04/08 It may not always end with ի
2005/10/14 Syntax Savvy dictionaries?
2005/03/21 Linguistic and Unicode considerations (or Language-specific Processing #4)
2005/03/14 You toucha my letters, IWordBreaker you face (or, Language-specific processing, #3)
2005/03/13 What about search for kids?