Linguistic and Unicode considerations (or Language-specific Processing #4)

by Michael S. Kaplan, published on 2005/03/21 00:13 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/03/21/399589.aspx

Prior posts in this series:
   Before you find, or search, you have to *index* (or, Language-specific processing #0)
   I coffee, therefore IFilter (or, Language-specific processing #1)
IStemmer'ed the tide (or, Language-specific processing #2)
   You toucha my letters, IWordBreaker you face (or, Language-specific processing, #3)

Looking at a doc site that many of the IFilter interface, IWordBreaker interface, and IStemmer interface topics link to, there is a page entitled Linguistic and Unicode Considerations that is worth taking a really good look at.

One annoying bit about a few of these pages is that they are in different encodings yet are not marked as such. It looks like this is caused by the content wrapper rather than the content itself, incidentally. The page is marked "charset=windows-1252" even though it is clearly not in several cases (even if the "frame" is, a side effect of the wider decision to not using HTML frames here?). My IE 6.0.3790.0 browser requires me to change the encodings by right clicking. When I had to do this, I put the information below in green. YMMV, but it will probably be about the same distance. :-)

The initial page is a link to a bunch of topics, which I am going to list here, with some thoughts on each....

Surface Form Normalization talks about hyphenation, possessives, diacritics, and clitics (the latter two are the only ones that really hint at best practices in other languages -- I would have loved to see some non-English hyphenation of possessive examples!). The last two topics are UTF-8, the difference between "Ã¼ber" and "über", between "donnÃ©s" and "donnés".

Phrase Identification talks about the best time to identify phrases (at query time, according to the topic). No non-English examples, which seemed a little unfortunate to me.

Agglutinative Languages is very much about languages other than English (like Finnish, a language which would be an interesting challenge to do a good word breaker for!).

Numbers and Dates gives a not intuitive to everyone but interesting and sensible recommendation -- to store numbers, dates, and times in a locale-independent manner. What makes this sensible is that at query time the person querying will use their own preferences, which may not match those of the document. Although doing this makes sense, that does not make it easy. This can be quite a challenge in Win32 (which has no locale sensitive parsing APIs) but would be much easier in the .NET Framework (which contains such functionality). Not sure how big managed IStemmers are, though. Might be good to find out? :-)

Compound Words are very common in languages like German and Dutch, and the recommendations in such cases are simple -- index all the component parts of such words. This would definitely require good knowledge of the language in question to do, for obvious reasons.

Compound Phrases mainly talks about such phrases in agglutinative languages like Korean. The topic explains how a good word breaker will have to be able analyze the text different ways and then have techniques to weigh them and properly index the text. The topics actually wants a Korean encoding, and the difference for me was "Á¦ÀÛÀÚÀÇ" versus "제작자의".

Special Characters and Words mainly talks bout how you usually want to ignore symbols and punctuation, though it does explain how there are exceptions that can be blamed on programming languages (e.g. "C++") or on Microsoft (e.g. ".NET") or on both (e.g. "C#"). Obviously there are other types of exceptions, many of which might be domain-specific. A generic word breaker may have less interest than one that caters to a particular domain.

Acronyms and Abbreviations gives some interesting thoughts about how acronyms and abbreviations might be handled (some of the same domain-specific rules may apply).

Capitalization talks about how Indexing Service does not preserve capitalization for the full-text index, a fact that I will tentatively consider depressing for Turkic languages like Azeri and Turkish but might hope for linguistic casing rules that would do the right thing for these scenarios. :-)

Nonbreaking Spaces actually talks about all sorts of characters that might imply continuity of on sort while preserving separation of another, like the underscore in welcome_home_sis.txt, where the indexing recommendation is to index the words both with and without breaks. It also seems a little arbitrary to see this split out from two related topics (Hyphenation and Phrase Identification) since some key phrases may be linked in other ways beyond nonbreaking spaces and underscores, yet the other topics do not hint at this. As a side note, such a rule should not be universally applied to similar seeming but entirely different characters like U+200d (ZERO WIDTH JOINER) and U+200c (ZERO WIDTH NON-JOINER), which do not break semantic content.

Surrogate Pairs implicitly relies on the fact that the MS Search service is intrinsically UTF-16 based, and gives the rules about surrogate pairs (one high surrogate followed by on low surrogate). It does flunk the Unicode terminology test with sentences like "Surrogate pairs extend the character set beyond the Unicode character". The text comes close to implying that such pairs should be indexed separately even though each one is some cases only a single letter -- which clearly implies that they should just be indexed as they would be given the need to index characters and words in a language. This will become more important as larger corpuses of text are produced in many of the historic and other languages that use supplementary characters for their representation.

What I found most interesting about all of these topics was the balance they had to strike between

hinting at complexities for the sake of people who are either unaware of them in other languages or even their own;
not wanting to go too far off topic for people focused on a particular language to whom not all items would apply;
recognizing that anyone truly contemplating building a word breaker in a language will understand many of the linguistic issues and thus not need a primer about them but instead more examples of how the concepts are applied;

How did they do? Well, I think they did not really give enough detail to help much with #3 (in these topics or in the other ones in the Platform SDK), and I would have loved more of #1 as a part of the "fascinated by linguistic concepts" fetish I have going. I think both may have been due to not enough time to do more, in which case I would love to see more issues covered, for more languages. Were I more of an expert I would offer to help with that. :-)

Though I may help with some of the Unicode topics -- more would be interesting a better terminology would also be cool. Now, to track down the owner....

This post sponsored by U+200d (ZERO WIDTH JOINER) and U+200c (ZERO WIDTH NON-JOINER)
Two characters that would love more visibility, which is no small feat when you are invisible like they are!

# AC on 21 Mar 2005 8:44 AM:

Michael, get to it! You should be writing some documentation. And fixing the charset bug in MSDN online while you are there.

# Michael Kaplan on 21 Mar 2005 8:55 AM:

Well, I will try to help, but I am not going to pretend I could write all the docs!

As for the charset bug, I do not know enough about the content delivery mechanism to even guess at solutions....

# Mike Dimmick on 30 Mar 2005 4:02 AM:

Re: MSDN charset bug - my guess is that there's a double-encoding bug in there. I was reading some of the Patterns & Practices Enterprise Library documentation earlier, which referred to a 'façade'. Or would have, if an encoding bug hadn't intervened.

What was actually shown was "faÃƒÂ§ade". The page was marked Windows-1252 in the HTML using the <META> element (Fiddler reveals that no encoding was specified at the HTTP level). Manually switching the encoding to UTF-8 gets us to faÃ§ade. I'm guessing that the source material (in XML?) was in UTF-8, but that the XML->HTML renderer interprets it as Windows-1252 and re-encodes the misinterpreted data as UTF-8. Undoing the UTF-8 transformation one more time gets us 0xC3 0xA7 => 0xE7 = ç.

It looks like this particular page (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/dnpag2/html/updaterv2.asp) went wrong when it was decompiled from the offline documentation (implied by Tim Ewald at http://pluralsight.com/blogs/tewald/archive/2004/09/23/2368.aspx). The offline documentation, in Document Explorer 7.0 (HxS) format, is encoded as Windows-1252.

The last couple of times I've reported this to MSDN they've simply patched the topic to use ASCII characters (removed diacritic marks, converted en or em hyphens to regular hyphens, smart quotes to straight quotes, replacement of ® with (R) and © with (C), etc).

I think, however, that little work will now be done to the live MSDN Library site CMS as there's already a replacement which runs msdn2.microsoft.com (and according to Tim Ewald, also TechNet).

Just as a side note - I notice that the typeface for the MSDN logo has changed recently. You can still see the old logo at msdn2.microsoft.com. The new one looks like the same face as the Windows Server System logo - the old one was consistent with Windows XP and Windows 2000.

# Michael Kaplan on 30 Mar 2005 5:52 AM:

No double encoding bug in these cases (everything was fixed with the one conversion). But if memory serves the original MSDN site used frames, which could of course have charset headers at the subpage level. Now there is only the top level, and if there is no good communication mechanism than the only way to make it work is to always use UTF-8 (which they do not seem to want to do).

Ah well, five steps forward, two steps back. Thats my motto!

Yuhong Bao on 30 Jul 2010 2:23 PM:

"Compound Phrases mainly talks about such phrases in agglutinative languages like Korean. The topic explains how a good word breaker will have to be able analyze the text different ways and then have techniques to weigh them and properly index the text. The topics actually wants a Korean encoding, and the difference for me was "Á¦ÀÛÀÚÀÇ" versus "제작자의"."

This one is still there. Have you reported it to the MSDN folks? Luckily the UTF-8 ones are fixed by now.

Michael S. Kaplan on 30 Jul 2010 3:16 PM:

The problems were reported but apparently not fixed. :-(

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2011/07/08 Not dumb, but dumb quotes! (aka Sorry Mr. Boehner, this one may be our fault)

2008/04/23 That brings new meaning to having "a ç-section" (Ãç§), doesn't it?

2007/10/17 CSI: Unicode?

2007/08/11 Should old aquaintance *not* be forgot, code pages may screw up their names anyhow

2006/12/23 Do not adjust your browser, a.k.a. sometimes two wrongs DO make a right, a.k.a. dumb quotes

2006/06/04 What's the encoding, again?

go to newer or older post, or back to index or month or day