by Michael S. Kaplan, published on 2005/03/21 00:13 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/03/21/399589.aspx
Prior posts in this series:
Before you find, or search, you have to *index* (or, Language-specific processing #0)
I coffee, therefore IFilter (or, Language-specific processing #1)
IStemmer'ed the tide (or, Language-specific processing #2)
You toucha my letters, IWordBreaker you face (or, Language-specific processing, #3)
Looking at a doc site that many of the IFilter interface, IWordBreaker interface, and IStemmer interface topics link to, there is a page entitled Linguistic and Unicode Considerations that is worth taking a really good look at.
One annoying bit about a few of these pages is that they are in different encodings yet are not marked as such. It looks like this is caused by the content wrapper rather than the content itself, incidentally. The page is marked "charset=windows-1252" even though it is clearly not in several cases (even if the "frame" is, a side effect of the wider decision to not using HTML frames here?). My IE 6.0.3790.0 browser requires me to change the encodings by right clicking. When I had to do this, I put the information below in green. YMMV, but it will probably be about the same distance. :-)
The initial page is a link to a bunch of topics, which I am going to list here, with some thoughts on each....
Surface Form Normalization talks about hyphenation, possessives, diacritics, and clitics (the latter two are the only ones that really hint at best practices in other languages -- I would have loved to see some non-English hyphenation of possessive examples!). The last two topics are UTF-8, the difference between "über" and "über", between "donnés" and "donnés".
Phrase Identification talks about the best time to identify phrases (at query time, according to the topic). No non-English examples, which seemed a little unfortunate to me.
Agglutinative Languages is very much about languages other than English (like Finnish, a language which would be an interesting challenge to do a good word breaker for!).
Numbers and Dates gives a not intuitive to everyone but interesting and sensible recommendation -- to store numbers, dates, and times in a locale-independent manner. What makes this sensible is that at query time the person querying will use their own preferences, which may not match those of the document. Although doing this makes sense, that does not make it easy. This can be quite a challenge in Win32 (which has no locale sensitive parsing APIs) but would be much easier in the .NET Framework (which contains such functionality). Not sure how big managed IStemmers are, though. Might be good to find out? :-)
Compound Words are very common in languages like German and Dutch, and the recommendations in such cases are simple -- index all the component parts of such words. This would definitely require good knowledge of the language in question to do, for obvious reasons.
Compound Phrases mainly talks about such phrases in agglutinative languages like Korean. The topic explains how a good word breaker will have to be able analyze the text different ways and then have techniques to weigh them and properly index the text. The topics actually wants a Korean encoding, and the difference for me was "Á¦ÀÛÀÚÀÇ" versus "제작자의".
Special Characters and Words mainly talks bout how you usually want to ignore symbols and punctuation, though it does explain how there are exceptions that can be blamed on programming languages (e.g. "C++") or on Microsoft (e.g. ".NET") or on both (e.g. "C#"). Obviously there are other types of exceptions, many of which might be domain-specific. A generic word breaker may have less interest than one that caters to a particular domain.
Acronyms and Abbreviations gives some interesting thoughts about how acronyms and abbreviations might be handled (some of the same domain-specific rules may apply).
Capitalization talks about how Indexing Service does not preserve capitalization for the full-text index, a fact that I will tentatively consider depressing for Turkic languages like Azeri and Turkish but might hope for linguistic casing rules that would do the right thing for these scenarios. :-)
Nonbreaking Spaces actually talks about all sorts of characters that might imply continuity of on sort while preserving separation of another, like the underscore in welcome_home_sis.txt, where the indexing recommendation is to index the words both with and without breaks. It also seems a little arbitrary to see this split out from two related topics (Hyphenation and Phrase Identification) since some key phrases may be linked in other ways beyond nonbreaking spaces and underscores, yet the other topics do not hint at this. As a side note, such a rule should not be universally applied to similar seeming but entirely different characters like U+200d (ZERO WIDTH JOINER) and U+200c (ZERO WIDTH NON-JOINER), which do not break semantic content.
Surrogate Pairs implicitly relies on the fact that the MS Search service is intrinsically UTF-16 based, and gives the rules about surrogate pairs (one high surrogate followed by on low surrogate). It does flunk the Unicode terminology test with sentences like "Surrogate pairs extend the character set beyond the Unicode character". The text comes close to implying that such pairs should be indexed separately even though each one is some cases only a single letter -- which clearly implies that they should just be indexed as they would be given the need to index characters and words in a language. This will become more important as larger corpuses of text are produced in many of the historic and other languages that use supplementary characters for their representation.
What I found most interesting about all of these topics was the balance they had to strike between
How did they do? Well, I think they did not really give enough detail to help much with #3 (in these topics or in the other ones in the Platform SDK), and I would have loved more of #1 as a part of the "fascinated by linguistic concepts" fetish I have going. I think both may have been due to not enough time to do more, in which case I would love to see more issues covered, for more languages. Were I more of an expert I would offer to help with that. :-)
Though I may help with some of the Unicode topics -- more would be interesting a better terminology would also be cool. Now, to track down the owner....
This post sponsored by U+200d (ZERO WIDTH JOINER) and U+200c (ZERO WIDTH NON-JOINER)
Two characters that would love more visibility, which is no small feat when you are invisible like they are!
# AC on 21 Mar 2005 8:44 AM:
# Michael Kaplan on 21 Mar 2005 8:55 AM:
# Mike Dimmick on 30 Mar 2005 4:02 AM:
# Michael Kaplan on 30 Mar 2005 5:52 AM:
Yuhong Bao on 30 Jul 2010 2:23 PM:
"Compound Phrases mainly talks about such phrases in agglutinative languages like Korean. The topic explains how a good word breaker will have to be able analyze the text different ways and then have techniques to weigh them and properly index the text. The topics actually wants a Korean encoding, and the difference for me was "Á¦ÀÛÀÚÀÇ" versus "제작자의"."
This one is still there. Have you reported it to the MSDN folks? Luckily the UTF-8 ones are fixed by now.
Michael S. Kaplan on 30 Jul 2010 3:16 PM:
The problems were reported but apparently not fixed. :-(
referenced by
2011/07/08 Not dumb, but dumb quotes! (aka Sorry Mr. Boehner, this one may be our fault)
2008/04/23 That brings new meaning to having "a ç-section" (Ãç§), doesn't it?
2007/10/17 CSI: Unicode?
2007/08/11 Should old aquaintance *not* be forgot, code pages may screw up their names anyhow
2006/12/23 Do not adjust your browser, a.k.a. sometimes two wrongs DO make a right, a.k.a. dumb quotes
2006/06/04 What's the encoding, again?