You toucha my letters, IWordBreaker you face (or, Language-specific processing, #3)

by Michael S. Kaplan, published on 2005/03/14 08:22 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/03/14/395199.aspx


Prior posts in this series:
  
Before you find, or search, you have to *index* (or, Language-specific processing #0) 
   I coffee, therefore IFilter (or, Language-specific processing #1) 
   IStemmer'ed the tide (or, Language-specific processing #2)

Ok, I have talked a little bit about the IFilter interface and the IStemmer interface. I have also talked about the haunting words of premonition in Nadine Kano's book that referred to the IWordBreak interface.

I have been unable to find anyone who knows whether the name was changed or the text was inaccurate, not that it makes much of a difference. There was a guy I used to work with on the Access development team who would dig back into source logs (as far back Access 1.0 if that is what it took) just so he could explain who had introduced a bug. Fascination in the historical aspects of the development process is no more anti-social of a hobby than stamp collecting. If you know what I mean. :-)

Anyway....

The IWordBreaker interface, in the words of docs:

The IWordBreaker interface is a language-specific language resource component. The word breaker parses text and identifies individual words and phrases. The word breaker is used in background processes and must be optimized for both throughput and minimal use of resources.

The important method for our purposes is the BreakText method. It breaks text to identify words and phrases, information that it then provides in the form of WordSink and PhraseSink objects. As with the other methods, you can see how knowledge of a language can enable one to break the text in ways most likely to find good results.

The Platform SDK also has a great topic entitled Implementing a Word Breaker which among other things gives specific examples of what I am talking about in the text:

The main function of the BreakText method is to process text continuously from the text source until all the text is processed, or until the word breaker encounters an error. While in this data processing loop, the BreakText method calls parsing and utility methods that perform specific tasks for that process. For example, the German word breaker may handle compound words, whereas the French word breaker may process diacritics or clitics. The specific functions that the word breaker performs and the strategy that it employs in performing these tasks depend entirely on the requirements for that language.

When breaking text, word breakers identify "alternative" forms for words that may have multiple representations. No semantic relationship is implied between the generated words. In fact, the original word may not be included among the list of alternatives. The alternative forms are saved in the same position in the index as the original word to indicate that they are identical.

When a document is included in the index, each word is assigned an integer value that represents the offset, or the distance of the word from the beginning of a document. The relative distance between words in a query is compared against the offsets stored in the full-text index. The query "Where is Kyle's document" matches any document with "Where" at offset n, "is" at n+1, "Kyle's" at n+2, and "document" at n+3. "Where is Kyle's document filed in the data-base?" is represented as:

Where is Kyle
Kyle's
document filed in the database
data base

In this example, the word breaker stores alternative forms for "Kyle" ("Kyle's") and "database" ("data base") in the index. The word breaker generates and stores alternative words during the index creation process under the following conditions:

  • If an alternative word is likely to appear as a single word in a query
  • If a stemmer is not likely to derive the original word from the alternative

Generating alternative word forms increases the number of ways that queries represent and match a sentence:

  1. Where is Kyle document filed in the database
  2. Where is Kyle's document filed in the database
  3. Where is Kyle document filed in the data base
  4. Where is Kyle's document filed in the data base

It then goes on actually explain how the WordSink and PhraseSink objects are used:

Word breakers use the WordSink and PhraseSink objects to collect and store all words and phrases that they extract from text. A word breaker stores words in a form that is as close as possible to the original word form in the document. The PhraseSink stores phrases at query time. Phrases improve the relevance of query results because longer sequences of words are rarer and provide greater distinction than smaller phrases. When Indexing Service places a phrase in the PhraseSink at query time, it creates an instance of the word breaker to break the phrase into words. Indexing Service then evaluates the phrase by checking whether the words in the phrase occur adjacent to one another in the index. For example, if "ABCD" occurs in the index at positions x, x+1, x+2, and x+3, the phrase match will occur if any adjacent substring of "ABCD" is submitted in a query. This strategy is effective for character-based word breakers that split phrases and long words during index creation and that generate phrases during query time.

Phrases are interesting because they can veer into areaa that are both domain specific (think about common phrases that might be recognized, like "word breaker") and language specific (think about languages with words that are particles or have very little use as independent search terms but which may be incredibly useful in providing context for surrounding words). Also, the Wordsink and PhraseSink objects are interesting in that they let you add alternate word and phrases as they make sense. Thus if they search for 'Micro$oft' you can put in 'Microsoft' and such <grin>, and other more linguisticlaly sensible word and pharse substitutions.

Finally it moves into area that truly involve language issues, and anyone who is using Unisrribe will recognize the concepts, about the actual breaks:

Breaks are the spaces between words. White space, punctuation, formatting, or just the nature of the language itself can cause breaks. There are four different types of breaks that Indexing Service uses: end of word (EOW), end of sentence (EOS), end of paragraph (EOP) and end of chapter (EOC). The EOW break is the default break. After each token, each break indicates a different semantic distance between the words on either side. Words separated by EOW have the tightest semantic link, followed by EOS, EOP, and EOC. Multiple calls to PutBreak are cumulative, and are analogous to inserting null words or sentences.

It is unclear to me at this point how much of the information is truly leveraged and how much is there for potential future growth. But even if it is the latter I think it is a good direction to grow in, if you know what I mean.

Looking in the Platform SDK, there is the lrsample sample for Index Server -- basically it is the word breaker and stemmer sample described here. It has an interesting BreakText method in lrsample.CXX:

//+---------------------------------------------------------------------------
//
//  Member:     CSampleWordBreaker::BreakText
//
//  Synopsis:   Break a block of text into individual words
//
//  Arguments:  [pTextSource]  -- Source of characters to work on
//              [pWordSink]    -- Where to send the words found
//              [pPhraseSink]  -- Where to send the phrases found (not used)
//
//  Returns:    S_OK if successful or an error code
//
//----------------------------------------------------------------------------

HRESULT STDMETHODCALLTYPE CSampleWordBreaker::BreakText(
    TEXT_SOURCE * pTextSource,
    IWordSink *   pWordSink,
    IPhraseSink * pPhraseSink )
{
    // Validate arguments

    if ( 0 == pTextSource )
        return E_INVALIDARG;

    if ( ( 0 == pWordSink ) || ( pTextSource->iCur == pTextSource->iEnd ) )
        return S_OK;

    if ( pTextSource->iCur > pTextSource->iEnd )
        return E_INVALIDARG;

    ULONG cwcProcessed;   // # chars actually processed by Tokenize()
    HRESULT hr = S_OK;

    // Pull text from the text source and tokenize it

    do
    {
        BOOL fFirstTime = TRUE;

        while ( pTextSource->iCur < pTextSource->iEnd )
        {
            ULONG cwc = pTextSource->iEnd - pTextSource->iCur;

            // Process in buckets of cwcAtATime only
                 
            if ( cwc >= CSampleWordBreaker::cwcAtATime )
                cwc = CSampleWordBreaker::cwcAtATime;
            else if ( !fFirstTime )
                break;

            hr = Tokenize( pTextSource, cwc, pWordSink, cwcProcessed );
            if ( FAILED( hr ) )
                return hr;

            pTextSource->iCur += cwcProcessed;
            fFirstTime = FALSE;
        }

        hr = pTextSource->pfnFillTextBuffer( pTextSource );
    } while ( SUCCEEDED( hr ) );

    //
    // If anything failed except for running out of text, report the error.
    // Otherwise, for cases like out of memory, files will not get retried or
    // reported as failures properly.
    //

    if ( ( FAILED( hr ) ) &&
         ( FILTER_E_NO_MORE_VALUES != hr ) &&
         ( FILTER_E_NO_TEXT != hr ) &&
         ( FILTER_E_NO_VALUES != hr ) &&
         ( FILTER_E_NO_MORE_TEXT != hr ) &&
         ( FILTER_E_END_OF_CHUNKS != hr ) &&
         ( FILTER_E_EMBEDDING_UNAVAILABLE != hr ) &&
         ( WBREAK_E_END_OF_TEXT != hr ) )
        return hr;

    ULONG cwc = pTextSource->iEnd - pTextSource->iCur;

    if ( 0 == cwc )
        return S_OK;

    return Tokenize( pTextSource, cwc, pWordSink, cwcProcessed );
} //BreakText

The supporting code behind this method (CSampleWordBreaker::Tokenize, ScanChunk, and IsWordChar in particular) are also of interest here if you want to see a working sample. Also note how the code is careful to avoid re-entrancy, which can be a problem so severe that it is probably the best reason to have a sample in the Platform SDK (so we do not find implementations of IWordBreaker in the wild that bring the process to its knees!).

But if you are doing language-specific work, the ability to do the word breaking is huge if you have the information. It is widely believed that the Thai word breaker in Microsoft Word is a bit more sophisticated than the one in Uniscribe and GDI+ specifically because it has an extended dictionary behind it (I have never seen the code so I can neither confirm nor deny this). Nadine Kano hints at the problems in her book:

Dividing Lines of Text in Thai

Thai editions of Windows come with a fairly sophisticated line-breaking algorithm. If you are writing a Thai-language application, take advantage of what the system provides rather than trying to come up with your own line-breaking code. To give you an idea of what would be involved, try to decipher the following line:

Imaginethatthisisastringtobewordwrappedtheonlywaytodosoinenglishwouldbetoidentifytheindividualwordsandthendeterminethebestplacetobreaktheline

Translation: Imagine that this is a string to be word-wrapped. The only way to do so in English would be to identify the individual words and then determine the best place to break the line.

The line-breaking algorithm provided by the system solves these problems for you.

But if you have better information than what one of Microsoft's word breakers provide then putting in your own word breaker may be just what you (and everyone else) needs!

You may also find a disconnect between components like word breakers for search and Uniscribe at times, and perhaps the word breakers that interest you are the ones that remove this disconnect. Though this is an interesting challenge if you are not the person doing the rendering, or even if you are since they would typically happen at different times. I suspect that even applications like Word cannot do a lot of this sort of thing at the same time given the amount of memory it would take to store the information just in case someone wanted to search.

Luckily, many of the Uniscribe methods like ScriptBreak and ScriptStringBreak can be called independently of the orginal rendering for analysis purposes (more on this another day!).

Next time in the series I'll talk more about the specific language issues that may come up when trying to break and stem streams of text....

 

This post brought to you by "≈" (U+2248, a.k.a. ALMOST EQUAL TO)


# AC on 14 Mar 2005 8:35 AM:

An interesting, if disjointed, post!

While we wre left wondering how hard it truly is to create the perfect word breaker, the post implies that even the ones Microsoft could you some work. Do you know if any of that work is happening?

# Michael Kaplan on 14 Mar 2005 8:51 AM:

I honestly do not know what the plans are here, though I plan to keep my eye on places like http://research.microsoft.com/ in case they come up with something!

But you are right that there is much room for future improvement here. And since the architecture is open someone else can do it today if they understand the rules of the language. It is all about getting out of the way.

# Joseph Petersen on 14 Mar 2005 3:18 PM:

How do you call Uniscribe independent of rendering?

# Michael Kaplan on 14 Mar 2005 3:31 PM:

Excellent question, Joseph! I'll be talking about this soon!

# Chrstoph Eisenmann on 15 Mar 2005 11:04 PM:

Hallo <br> <br>You write: <br> <br>Thai editions of Windows come with a fairly sophisticated line-breaking algorithm. If you are writing a Thai-language application, take advantage of what the system provides rather than trying to come up with your own line-breaking code. To give you an idea of what would be involved, try to decipher the following line: <br> <br>... <br> <br>This is my problem. I urgently would need a method which is breaking a thai and laotian sentance into words. I know. It is very complicated, because just seperating by words is not enough. You have also to consider the type of document. <br> <br>The problem is: In thai and laotian where are combined words which make only sence in a document written by laywers, or doctors for example . Take gotmai (the law) the two words make sense independant got mai, but in laywers language they belong together (should not be broken into pieces) otherwise they should. <br> <br>I need a method which uses a <br> <br>- domain specific (like law, medicine) word dictionary first) <br> <br>- a general dictionary (thai) second. <br> <br>- If still no satifying result. Try to break by formula <br> <br>Like said. This would be the best way. But I would be happy with anything which I could use for work breaking. Even (or better because) I am fluent in those languages, I would prefer something which is ready or can be mitified. <br> <br>Can you point me to a direction where I can find some classes, libraries e.t.c. suitable for vb.net (or C#) <br> <br>Thanks for your nice blog. I really like it. <br> <br>Chris <br> <br>Bangkok/Thailand <br>Vietiane/Lao (PDR) <br>

# Tanveer Badar on 20 Dec 2007 5:11 AM:

"There was a guy I used to work with on the Access development team who would dig back into source logs (as far back Access 1.0 if that is what it took) just so he could explain who had introduced a bug."

That's what I do to get the smelly hat off my head.


referenced by

2007/08/12 Blast from the [email] past about IWordBreaker

2006/11/12 If I wasn't watching her blog, could you really claim I was filtering the list of blogs I read?

2006/08/20 Avrupalılaştıramadıklarımızdanmışsınızcasına

2005/03/21 Linguistic and Unicode considerations (or Language-specific Processing #4)

go to newer or older post, or back to index or month or day