by Michael S. Kaplan, published on 2005/07/19 05:13 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/07/19/440320.aspx
Text elements have been called that since version 1.0 of the .NET Framework. MSDN defines them as follows:
The .NET Framework defines a text element as a unit of text that is displayed as a single character; that is, a grapheme. A text element can be a base character, a surrogate pair, or a combining character sequence. The Unicode Standard defines a surrogate pair as a coded character representation for a single abstract character that consists of a sequence of two code units, where the first unit of the pair is a high-surrogate and the second is a low-surrogate. The Unicode Standard defines a combining character sequence as a combination of a base character and one or more combining characters. A surrogate pair can represent a base character or a combining character. For more information on surrogate pairs and combining character sequences, see The Unicode Standard at http://www.unicode.org.
In Unicode, what the .NET Framework refers to as a text element is referred to as either a grapheme cluster or a supplementary character. In Unicode it would not be common to put supplementary characters (surrogate pairs) under the same category as grapheme clusters, since as even the definition above claims:
A surrogate pair can represent a base character or a combining character.
This makes it a little weird in terms of the text definition if and when such code points are allocated, since people will start getting confused about the fact that a substring with a supplementary character is a text element just as that supplementary character plus a diacritic is also a text element. But this seeming weirdness is just that -- seeming. I mean, if you think about it U+0063 (LATIN SMALL LETTER C) is as text element, but so is it plus U+0301 (COMBINING ACUTE ACCENT). So maybe the definition could be cleaned up a little to get rid of the potential weirdness here. But there is no hurry just yet (not too many complaints to date).
There is one crucial part of the definition of a text element that is not covered which probably ought to be, especially in contrast to a sort element, the contender in the blue corner. I'll get to it in a minute.
A sort element is a code point or combination of code points that a user thinks of as a character. Sometimes a text element is a sort element, other times it is not (a good example of one that is not would be the Traditional Spanish "ch"). Sort elements are the basis of collation support on Windows and since there is no function or method that specifically enumerates them, they are usually thought of only in terms of the special "exception" or "compression" entries that a given locale has. This is a little unfair since every letter that is used could be a sort element, just like it could be a text element.
I've talked about both of these items before, but really not done much to give a rigorous way to determine the difference between them other than to say that you can detect one but not the other with parts of the managed globalization APIs. But I'll tie it all together now, and lay out the difference between a text element and a sort element. Ready? Here goes....
A text element is based on the script and is independent of language. It always exists even if it has no specific impact on collation.
A sort element is based on the language and is therefore entirely dependent on language. It really only exists when it is defined for the language.
(and the penny drops!)
This is the reason why I was so unhappy the other day about the fact that FoldString does not take an LCID -- because the ligature expansion that takes place is entirely language dependent in practice when used in collation but is language independent when FoldString is used to do the expansion. Even though they ought to be based entirely on language and not script. Darn.
The two constructs have a few things in common:
And they have a few points of contrast (when a sort element is not also a text element, of course):
Probably enough to distinguish between these two types, for now.... :-)
This post brought to you by "ʤ" (U+02a4, a.k.a. LATIN SMALL LETTER DEZH DIGRAPH)
referenced by
2008/07/25 Let's save some time and call them all IRregular expression engines
2008/01/25 On reversing the irreversible (grabbing the data, part I)
2007/09/15 A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons)
2006/10/04 Wild[card] thing, You make my CHAR sing
2005/11/13 Hungarian is even more complicated than I thought
2005/07/20 More on sort elements