Sorting multilingual data

by Michael S. Kaplan, published on 2006/01/01 23:08 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/01/01/508503.aspx


Benski asked me

Hi Michael,

I found your blog recently (while searching for how FAT32 & NTFS deal with case insensitivity). I've been reading non-stop for a few days now, but still have a lot of archives to go through :) Love it so far.

I've got a question for you. It's something I've been thinking about as I overhaul a software product for "proper" unicode support.

What is the best approach for dealing with strings from multiple languages? Display is simple, but I'm a little unsure about sorting and searching.

Imagine an inventory management system. There are English, German, and Swedish products in the system. If a user with an English locale lists items alphabetical, what should happen with diacritics, since the sorting rules are different for German and Swedish?

If the system were to know which products were Swedish and which were German, does this help out? If I generate sort keys with LCMapString on two strings, but call once with a German locale and the other with a Swedish locale, will the two sorts keys be "compatible". I would think the correct sorting order would be for German diacritical letters to sort near their non-marked cousin, while Swedish diacritics would sort at the end of the alphabet. Is this correct?

Thanks for the kind words! :-)

I sort of hinted at the answer in this post, but did not explicitly state the answer there. Cathy Wissink and I covered it in a few of the Unicode talks that we did. But I don't think I have covered it here, except a little bit in directly in this other post.

So I think I will do so now....

The rule is simple -- the person who is looking at the data has a specific expectation, and it is a single collation, the one that they know. They do not expect data in a different language to sort like a user of that other language would.

This is one of the reasons that collation on Windows handles all of Unicode, from the point of view of a given language.

To answer the other question, if you pick up sort keys from two different locales, they are not for most purposes 'compatible' in the sense of being able to be meaningfully compared. So generating indexes in that kind of a way will not produce meaningful indexes....

 

This post brought to you by "Ö" (U+00d6, a.k.a. LATIN CAPITAL LETTER O WITH DIARESIS)


# Serge Wautier on 2 Jan 2006 7:35 AM:

Out of curiosity, how are the other scripts weighted compared to the 'local' script ?

And who decides it ? I'd be surprised that the Académie française defines sort rules for greek or japanese characters.

# Michael S. Kaplan on 2 Jan 2006 8:20 AM:

They do not, thank goodness!

The relative ordering between them is something we define, and that definition does not really change in most cases when the LCID does.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day