by Michael S. Kaplan, published on 2005/07/20 04:15 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/07/20/440842.aspx
Yesterday I contrated sort elements and text elements. I am now going to leave text elements aside for a bit. Because linguistic collation on Windows, at its heart, is an ordering based on sort elements, not text elements.
Every time I look at the text in the Platform SDK for the CompareString function, I cannot help but smile:
If the two strings are different lengths, they are compared up to the length of the shortest one. If they are equivalent at that point, the return value indicates that the longer string is greater.
The truth is that all throughout the string, it is the sort elements that are being compared. There are times that one code point actually represents two sort elements (think Æ, a.k.a. U+00c6 a.k.a. LATIN CAPITAL LETTER AE in some languages) or three sort elements (think ffi, a.k.a. U+fb03 a.k.a. LATIN SMALL LIGATURE FFI in other languages). There are other times that two code points (think ch in Traditional Spanish) or three code points (think dzs in Hungarian) make up a single sort element. Other times code points have no weight and they are ignored entirely, having no sort element at all.
So if each code point will have between 0 and 3 sort elements (with fractional values supported), it is hard to try and equate string length to any operation beyond when to stop looking. The string length is definitely not a count of relevant elements to consider!
It makes the notion of that sentence from the documentation almost comical. Since CompareString is looking at each string, one sort element at a time, the only length that is meaningful to it is the length in sort elements; it is only when the sort elements are equivalent until one string ends that the issue with the longer string being greater comes into play.
On the other hand, I would hate to suggest trying to inject the notion of sort elements into the Platform SDK just to have a nicer sentence in the one doc topic.
I guess that is what this blog is for. :-)
Now lest you think it is all easy now once you add this one "conceptual simplification", I promise to make it seem harder again while talking about the reverse diacritics used in French, the double compressions used in Hungarian, tricks with Jamo and Old Hangeul, the full story on Hiragana and Katakana, the stuff happening in Longhorn, and more.
But it is still a good start. This whole subject ought to be a lot easier, conceptually. Any subject that just about every single person in the world who can read is able to intuitively understand ought to be easier conceptually, even if most of those people cannot explain how it works. Maybe if they have been and plan to keep reading here, they will be able to. :-)
This post brought to you by "ffi" (U+fb03, a.k.a. LATIN SMALL LIGATURE FFI)
# silverpie on 20 Jul 2005 8:46 AM:
referenced by
2008/01/25 On reversing the irreversible (grabbing the data, part I)
2007/09/15 A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons)
2006/11/10 Some people feel really insecure about the size of their [string] members
2006/10/04 Wild[card] thing, You make my CHAR sing
2006/07/09 The fallacy of comparing out of context
2006/01/15 Falling over the edge of a conceptual collation cliff
2005/11/13 Hungarian is even more complicated than I thought