More on sort elements

by Michael S. Kaplan, published on 2005/07/20 04:15 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/07/20/440842.aspx


Yesterday I contrated sort elements and text elements. I am now going to leave text elements aside for a bit. Because linguistic collation on Windows, at its heart, is an ordering based on sort elements, not text elements.

Every time I look at the text in the Platform SDK for the CompareString function, I cannot help but smile:

If the two strings are different lengths, they are compared up to the length of the shortest one. If they are equivalent at that point, the return value indicates that the longer string is greater.

The truth is that all throughout the string, it is the sort elements that are being compared. There are times that one code point actually represents two sort elements (think Æ, a.k.a. U+00c6 a.k.a. LATIN CAPITAL LETTER AE in some languages) or three sort elements (think , a.k.a. U+fb03 a.k.a. LATIN SMALL LIGATURE FFI in other languages). There are other times that two code points (think ch in Traditional Spanish) or three code points (think dzs in Hungarian) make up a single sort element. Other times code points have no weight and they are ignored entirely, having no sort element at all.

So if each code point will have between 0 and 3 sort elements (with fractional values supported), it is hard to try and equate string length to any operation beyond when to stop looking. The string length is definitely not a count of relevant elements to consider!

It makes the notion of that sentence from the documentation almost comical. Since CompareString is looking at each string, one sort element at a time, the only length that is meaningful to it is the length in sort elements; it is only when the sort elements are equivalent until one string ends that the issue with the longer string being greater comes into play.

On the other hand, I would hate to suggest trying to inject the notion of sort elements into the Platform SDK just to have a nicer sentence in the one doc topic.

I guess that is what this blog is for. :-)

Now lest you think it is all easy now once you add this one "conceptual simplification", I promise to make it seem harder again while talking about the reverse diacritics used in French, the double compressions used in Hungarian, tricks with Jamo and Old Hangeul, the full story on Hiragana and Katakana, the stuff happening in Longhorn, and more.

But it is still a good start. This whole subject ought to be a lot easier, conceptually. Any subject that just about every single person in the world who can read is able to intuitively understand ought to be easier conceptually, even if most of those people cannot explain how it works. Maybe if they have been and plan to keep reading here, they will be able to. :-)

 

This post brought to you by "" (U+fb03, a.k.a. LATIN SMALL LIGATURE FFI)


# silverpie on 20 Jul 2005 8:46 AM:

This is actually on the "reverse diacritics" post--the circumflexes have not been abolished from the words you mention there, only from words where no preëxisting word is identical but for the circumflex (in other words, all circumflexes needed to distinguish words remain).

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/01/25 On reversing the irreversible (grabbing the data, part I)

2007/09/15 A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons)

2007/01/15 With SQL Server (and SQL itself) comes the illogic of 'trailing spaces' (and the myth of fixed width)

2006/11/10 Some people feel really insecure about the size of their [string] members

2006/10/04 Wild[card] thing, You make my CHAR sing

2006/07/09 The fallacy of comparing out of context

2006/01/15 Falling over the edge of a conceptual collation cliff

2005/11/13 Hungarian is even more complicated than I thought

go to newer or older post, or back to index or month or day