Blast from the [email] past about IWordBreaker

by Michael S. Kaplan, published on 2007/08/13 02:01 -04:00, original URI:

A while back, Windows Fundamentals developer Ryan Myers asked me a question via email that I could not find a few days later (and did not remember who sent it, I only remembered that it had to do with Index Server).

Anyway, I found the message yesterday (looks like it ended up in the wrong folder by accident) and even though it is way too late to be of use, I thought I'd go ahead and try and cover it a bit.

Sorry for the delay, Ryan. :-(

Hi Michael,

I’ve been helping a friend with a test indexing app inside a game server and came across your blog posts from last year on the IWordBreaker interface from Indexing Server.  I had a quick question that maybe you could help me with.

He’s writing this server to be international, and wants to wordbreak properly; the game client sends an LCID as part of the connection process.  Some Googling pointed me to HKLM\SYSTEM\CurrentControlSet\Control\ContentIndex\Language – as far as I can tell, he can iterate over all subkeys and compare the Locale value to the LCID, and if it matches, convert the WordBreaker CLSID to a GUID and CoCreate away.  However, although this works, it seems a little ugly.

Is there a way to get to these classes without trawling the Registry?  If not, that’s fine; I’m just curious if there’s an official way to do it. J


The post Ryan was talking about was You toucha my letters, IWordBreaker you face (or, Language-specific processing, #3).

As it turns out, there really doesn't seem to be a better way. Though I can at least point to some official documentation now (I might be wrong but I am not sure it actually existed back then).

The topic is entitled Platform SDK: Indexing Service Registry Entries, and the topic actually points to some methods off of the AdminIndexServer object for programatically setting and retrieving the entries, if that makes it easier (it was apparently done to make scripting these things easier?).

There is a particular page with details on the Language-Specific Registry Entries which might also be useful (and talks about the Language_Dialect subkey off the key above which I have never actually seen on any machine).

Of course you'll notice that these are all LCID (well actually LANGID but they call it LCID bound). In their defense they may not have been at that presentation I talked about in Your LCID sucks, or they may not have believed it. In its own way this is unfortunate since there is documentation for anyone to create their own word breakers and stemmers, for any language (though every language has to fit within the LCID limitations).

We may have a suggestion or two here to make to the team. More on this another day... :-)


This post brought to you by w (U+0077, a.k.a. LATIN SMALL LETTER W)

no comments

go to newer or older post, or back to index or month or day