What's a secondary distinction?

by Michael S. Kaplan, published on 2005/12/29 15:16 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/12/29/508045.aspx


It was over a year ago that I talked about how They ask me "why is my Korean text in random order?"

It is a pretty important concept in collation to have items collate with muliple levels. What is interesting to about this concept is how it is so hard to describe to people yet how easily and intuitively those same people will recognize the results.

The differences are often very subtle rather than the more obvious case of Swedish (which I talked about here). Whether you to meet a letter like ā (U+0101, LATIN SMALL LETTER A WITH MACRON) as a LATIN SMALL LETTER A with a diacritic on top of it or as an entirely separate letter right after a is a different you will only notice if you are sorting several words whose differences include the two letters. Easy to see in a dictionary or a (potentially contrived) word list, to be sure, but not quite as obvious in everyday situations, even when the letters are commonly used.

To give an example, it is simply easier to see the difference between

and

but a simple listing of letters like

would look identical in the two cases. We you are not dealing with large lists such as dictionaries, you may not notice the difference.

Now there is one case where you often would have a large list on a computer, and that is an address book. However, even if you are looking at the list, in most cases you are typing the name which will shorten the list. In common usage, you may never notice that the computer is not meeting up to your expectations.

The end result (in cases where the computer does not match the user expectations) is usually either not noticing a subconscious sense that the computer has it wrong without an explicit understanding of what might be incorrect. If the differences end up being significant enough, they may eventually try to figure out what's wrong. But in most cases they will just report the bug rather than trying to dig into it. Because no matter how intuitive user expectations are, they're not very easily explained.

In collation terms, the difference between those first two lists that I gave above is whether U+0101 has a primary distinction from the letter a or a secondary one. But if somebody is giving you a list that makes up the ALPHABET for a language that type of distinction is usually absent. So if somebody tells you that their alphabet is:

aāàáâãäåbcdeēèéêëfghiīìíîïjklmnoōòóôõöpqrstuūùúûüvwxyýÿz

then you would not have enough information to decide how the letters should sort. Because no real information is being given about the primary and secondary distinctions. In many languages that have letters with diacritics, you can't assume that they are all even handled the same way!

In collation, this does become crucial in situations like the one I pointed out in You can't ignore diacritics when a language does not give them diacritic weight because of a difference between the users and the computer's expectations, usually because the language settings are incorrect on the computer....

Perhaps the best reason to make sure that your default user locale is set properly? :-)

 

This post brought to you by "ā" (U+0101, a.k.a. LATIN SMALL LETTER A WITH MACRON)


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/07/13 I swear the Latvian bug is fixed; it was fixed 4.5 years ago!

2006/11/01 If you add enough characters to a sort, intuitive distinction can suffer

go to newer or older post, or back to index or month or day