Microsoft is not uncaron^H^Hing about the issue!

by Michael S. Kaplan, published on 2007/04/11 11:48 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/04/11/2087685.aspx

Sometimes, in order get the best results in collation, one has to use constructs that from a linguistic or a Unicode general category standpoint might seem incorrect.

A good example is times that a character which is not on the list of Mn (Mark, Nonspacing) characters in Unicode is given only diacritic weight, such as ઽ (U+0abd a.k.a. GUJARATI SIGN AVAGRAHA) that I discussed here and which is an Lo (Letter, Other). The expected result in proper linguistic support in collation is achieved, but only by violating the common sense expectations about how the character ought to be classified and used.

Another example came from friend and former colleague Juraj who just emailed me a question the other day:

My friend in Slovakia asked me: "How do I ignore diacritics when doing a query on a column? I tried COLLATE Slovak_CI_AI but it didn't work. When I did for example a query WHERE Column LIKE '%C%', entries with 'Č' were not selected. Then I found a post on a forum saying that this is not a bug, but a feature. I don't understand: how is such collation useful? Can you explain?"

This is obviously not specific only to Slovak language (and not only to letter 'Č'), as I found through another blog post (in Czech):
http://blog.vyvojar.cz/mafalt/archive/2006/10/31/_0C01ED00_m-n_E100_s-mohou-p_5901_ekvapit-collations_3F00_-II_2E00_.aspx

Juraj is right -- both Czech and Slovak consider some of the letters such as U+010c (LATIN CAPITAL LETTER C WITH CARON) to not be treated as a base letter plus a diacritic but instead as a character with a unique alphabetic weight, an issue I discussed previously in this post and this one.

But notice what is happening both the example Juraj gave and in the Czech blog post he referenced -- people are not trying to order data, they are searching within data and are thus concerned with identity. And because of this, they consider the support that has been calibrated around the principle that ORDER and IDENTITY should be treated the same to be returning non-intuitive results.

It actually takes me back to the issues I described in Hungarian is even more complicated than I thought where the fact that collation algorithms such as both Microsoft's and Unicode's do not separate the kind of results produced by a CompareString function and a mythical EqualString function.

In the specific IDENTITY case of search, however, one can make the case that they should sometimes be separate, on a per-language basis -- it is pretty clear, for example, that most Swedes still would not want å (U+00e5, a.k.a. LATIN SMALL LETTER A WITH RING ABOVE) to ever be treated like an ordinary a (U+0061, a.k.a. LATIN SMALL LETTER A).

The difference between the two cases is obvious, of course -- if the letter sorts nowhere near its base (as in the case of Swedish where it sorts after z rather than near a), one would probably not want search to find anything other than the character one asked for. While in the cases such as Czech and Slovak (where the letter sorts after the base character but with a unique alphabetic weight), the "folding" within search is perhaps more expected.

HOWEVER, with that said, a simple IGNORE_NONSPACE result is also not what users would expect. They actually would in all likelihood still (when searching for Č) want the results with the CARON to be preferred, rather than simply treating the characters both with and without the CARON to be treated identically.

This is not a feature that collation in Windows or .NET or SQL Server or the UCA in Unicode currently supports, and in fact the data are laid out in a way that makes it harder to do, requiring multiple passes over the data.

This simple fact suggests possible solutions to the problem, none of which are easy but all of which are tractable. People interested in search may even want to be thinking about a whole new semantic around how to morph the meaning of the word to IGNORE to mean something more like PREFERBUTNOTREQUIRE.

But since someone from Microsoft has pointed the issue out, you clearly can't claim that Microsoft is unCARON about the problem!

And as an aside I think it is hilarious that most spellcheckers recommend uncaring when they see the word uncaron, my favorite "pun-as-a-psedudo-back-formation" of the day!

(this post dedicated to Petra and Oskar, the two people in the world who were special enough to draw away one of the five best collation testers of all time from their job in Redmond; no one will ever accuse Juraj of being unCARON!)

This post brought to you by č (U+010d, a.k.a. LATIN SMALL LETTER C WITH CARON)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day