Give me a [word-]break!

by Michael S. Kaplan, published on 2005/02/21 06:51 -05:00, original URI:

Recently in my post 'English only! (or how to misuse NLS APIs)' I gave an example of misusing locale settings that I hinted at when I talked about 'What is my locale? Well, which locale do you mean?'

I thought it might make sense to give a bit of a follow-up on what I have found out since then.

I was in a great meeting a few days ago that included the International Program Manager for the MSN Toolbar Suite. Now she (that IPM) is a true exception to that generalization I made about how usually the IPM is inexperienced, because she is experienced in the area of international stuff. I remember meeting her a few years ago at one of International Unicode Conferences (maybe in Hong Kong? They all kind of blend together!) and at least once I helped her team (at the time) with some VB6 code that gets the best font to use depending on language settings....

Anyway, I had a chance to find out from her what feature the MSN Toolbar Suite wanted on the machine. The problem boils down to word breakers.

I will talk more about word breakers (and stemmers!) another day.1

Now obviously both are critical for search to be maximally effective, as they guide the index creation that makes the search effective. Without specific knowledge about a language, how can one search for the various ways one can express a word?

Interestingly enough, for the particular scenario being handled here, none of the locales I mentioned in this post apply. Windows installs these components based on the "languages your system supports" that I talk about here. In XP and Server 2003 they have been replaced by the "supplemental language support" mentioned here in the middle tab.

Now this actually makes sense since you might have all of the word breakers installed (if e.g. those two CheckBoxes are both clicked) and yet your locale settings may also be pointing to some other locale.

To discover whether or not it is installed, you can do a simple call to the IsValidLocale API with the LCiD_INSTALLED parameter.

It easy to see why they want to be very careful when they report whether search will be affected. There are a huge number of variables to account for here (something else I will talk about another day when I talk about word breakers and stemmers!).

The good news here is that things will actually work well quite often! So you do not have to be afraid of the toolbar install if your settings are outside of "US English"; just ignore the warning and things will work quite well a lot of the time....

1 - For the moment its enough to give a basic definition. Word breakers are the things that (based on language) will find word boundaries so one has the items for which to search, and stemmers are the things that (again, based on language) will be able to extract "stems" of words so that additional forms of the word can be found (like if I search for "playing" and it can also find "play" and "played", etc.).

This post brought to you by "ڥ" (U+06a5, ARABIC LETTER FEH WITH THREE DOTS BELOW)

# Michael Kaplan on 21 Feb 2005 9:25 PM:

It is amazing -- post bad news and people are all over it; post good news and the silence is resounding.

Well, I think the quiet is nicer, but if find myself nodding off I will know the best way to wake up.... :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2005/03/08 Before you find, or search, you have to *index* (or, Language-specific processing #0)

go to newer or older post, or back to index or month or day