Before you find, or search, you have to *index* (or, Language-specific processing #0)

by Michael S. Kaplan, published on 2005/03/08 02:02 -05:00, original URI:

(I call this post #0 since it is more of an introduction to a topic that I will be returning to on a regular basis over the next few months.)

Back in the end of 2000, I had a meeting with the lead international program manager of SQL Server. One of the architects in the group had written a multiple page email describing the collation support in SQL Server 2000, and the PM wanted to include some more information about other parts of the SQL Server product to have a single place with all there is to know about SQL Server's international support could be found.

They estimated it would be about 10-15 pages. I put together an outline and let them know that to cover all of the topics on that outline it would actually be more like forty. They were a little staggered by the outline, but the "cover everything" idea was theirs, not mine. So they accepted my updated number.

Turns out we were both wrong.

The finished white paper International Features in Microsoft SQL Server 2000 came out in April 2001 and clocked in at somewhere between 57 and 65 pages, depending on whether you had the HTML version or the Word .DOC file version (content is the same, they just format pages and margins differently).

They got ripped off. It was a really fun project, I would probably have done it for nothing had I known how much fun it would be. I probably would have paid them for the point when the person in SQL Server marketing wanted to talk about my over-use of the word unfortunately when I talked about limitations. :-)

Now a lot of what was there, I already knew. For those topics the white paper was a chance to get it all down in one place (there was also going to be a book by Sams Publishing independent of this white paper entitled Internationalization with SQL Server but the publisher decided the market was not big enough to sustain the book. So they paid off my advance when the book was only about 10% turned in and 50% done. My Acquisitions Editor (Sharon Cox) had left Sams so I was not at all put out by this. Especially when they paid me off. :-)

Anyway, there were a few topics that were new to me, and one of those was the Microsoft Search service, which sits underneath SQL Server's Full-Text Search, Index Server, and Exchange Full-Text Search. And SharePoint. I had a few amazing conversations with Margaret Li where I learned about the work that the word breakers and stemmers do for the various languages supported by the engine underneath these search technologies. At the end, she pointed out that Nadine Kano first obliquely hinted at the interfaces that one would use to do this work in Developing International Software for Windows 95 and Windows NT:

Line-breaking and word-wrapping algorithms are important to text parsing as well as to text display. The rules for Asian languages, however, are quite different from the rules for Western languages. For example, unlike most Western written languages, Chinese, Japanese, Korean, and Thai do not necessarily indicate the distinction between words by using spaces. The Thai language doesn't even use punctuation. For these languages, software applications cannot conveniently base line breaks and word-wrapping algorithms on a space character or on standard hyphenation rules. They must follow different guidelines.

Because the Win32 full-text search engine for Microsoft WinHelp recognizes that word wrapping is more complex for some languages than for others, it supports the IWordBreak OLE interface. That way, if a third-party developer creates a superior word-wrapping algorithm for any language, the WinHelp engine can take advantage of it through OLE.

The only problem is that the interface was not yet public. Oops!

Margaret did tell me that her team was willing to do publish it but that they needed the time to get it done (and there never seemed to be enough time). Luckily someone did find the time, because today you can read all about the interfaces right on MSDN. And in this blog, in this series.... 

This post will be the first of what will be many articles on this fascinating area that is a cousin of collation and an uncle of search, but which has many interesting features and issues of its own.

The interfaces I will be talking about here are ones that are used by MSN Desktop Search, as I talked about previously in Give me a [word-]break! Imagine for a moment -- perhaps the act of creating such a component might one day allow components like MSN Desktop Search, SQL Server Full-Text Search, or any of the pothers to index content for you using the rules of your own language.

That is undeniably cool, is it not?

And it definitely falls into both of the categories of opening it all up and getting out of the way. :-)


This post brought to you by "L" (U+004c, LATIN CAPITAL LETTER L)
Because L is for Language and this letter just couldn't stay away from a cool topic like this one!

# bg on 8 Mar 2005 4:09 AM:

can't find anything about IWordBreak on MSDN, is it called something different now?



# Michael Kaplan on 8 Mar 2005 5:04 AM:

Interesting question, bg. :-)

The name that the book quoted was not actually correct. But I'll be posting more about the actual feature as it currently exists soon....

# Brian on 8 Mar 2005 7:50 AM:

bg: MK probably meant IWordBreaker

I don't know if it was a typo or a name change since he wrote this.

# Michael Kaplan on 8 Mar 2005 7:53 AM:

Yes,I did. it was not a typo -- I quoted Nadine's book directly (check it out via the link!).

I'll talk about some of these classes more later....

# RIO - Randektív Informatikai Oldal on 9 Mar 2005 3:54 AM:

Istanbul új nevet kapott:

# Anthony Mills on 11 Mar 2005 9:01 AM:

Typo alert!

In your linked SQL Server article, about halfway down, it says

... if you are using any COM service such as ADO to access the server, you must for its intervention.

That would probably be "you must plan for its intervention," I think ...

A great article, and thanks for linking it!

# Michael Kaplan on 11 Mar 2005 9:01 AM:

Indeed, no one ever brought that one up during the editing. Good catch! :-)

It was a lot of fun to write, and even more fun learning about the different technologies....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2005/03/21 Linguistic and Unicode considerations (or Language-specific Processing #4)

2005/03/14 You toucha my letters, IWordBreaker you face (or, Language-specific processing, #3)

2005/03/13 IStemmer'ed the tide (or, Language-specific processing #2)

2005/03/08 I coffee, therefore IFilter (or, Language-specific processing #1)

go to newer or older post, or back to index or month or day