Sort element vs. text element

by Michael S. Kaplan, published on 2005/07/19 05:13 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/07/19/440320.aspx

Text elements have been called that since version 1.0 of the .NET Framework. MSDN defines them as follows:

The .NET Framework defines a text element as a unit of text that is displayed as a single character; that is, a grapheme. A text element can be a base character, a surrogate pair, or a combining character sequence. The Unicode Standard defines a surrogate pair as a coded character representation for a single abstract character that consists of a sequence of two code units, where the first unit of the pair is a high-surrogate and the second is a low-surrogate. The Unicode Standard defines a combining character sequence as a combination of a base character and one or more combining characters. A surrogate pair can represent a base character or a combining character. For more information on surrogate pairs and combining character sequences, see The Unicode Standard at http://www.unicode.org.

In Unicode, what the .NET Framework refers to as a text element is referred to as either a grapheme cluster or a supplementary character. In Unicode it would not be common to put supplementary characters (surrogate pairs) under the same category as grapheme clusters, since as even the definition above claims:

A surrogate pair can represent a base character or a combining character.

This makes it a little weird in terms of the text definition if and when such code points are allocated, since people will start getting confused about the fact that a substring with a supplementary character is a text element just as that supplementary character plus a diacritic is also a text element. But this seeming weirdness is just that -- seeming. I mean, if you think about it U+0063 (LATIN SMALL LETTER C) is as text element, but so is it plus U+0301 (COMBINING ACUTE ACCENT). So maybe the definition could be cleaned up a little to get rid of the potential weirdness here. But there is no hurry just yet (not too many complaints to date).

There is one crucial part of the definition of a text element that is not covered which probably ought to be, especially in contrast to a sort element, the contender in the blue corner. I'll get to it in a minute.

A sort element is a code point or combination of code points that a user thinks of as a character. Sometimes a text element is a sort element, other times it is not (a good example of one that is not would be the Traditional Spanish "ch"). Sort elements are the basis of collation support on Windows and since there is no function or method that specifically enumerates them, they are usually thought of only in terms of the special "exception" or "compression" entries that a given locale has. This is a little unfair since every letter that is used could be a sort element, just like it could be a text element.

I've talked about both of these items before, but really not done much to give a rigorous way to determine the difference between them other than to say that you can detect one but not the other with parts of the managed globalization APIs. But I'll tie it all together now, and lay out the difference between a text element and a sort element. Ready? Here goes....

A text element is based on the script and is independent of language. It always exists even if it has no specific impact on collation.

A sort element is based on the language and is therefore entirely dependent on language. It really only exists when it is defined for the language.

(and the penny drops!)

This is the reason why I was so unhappy the other day about the fact that FoldString does not take an LCID -- because the ligature expansion that takes place is entirely language dependent in practice when used in collation but is language independent when FoldString is used to do the expansion. Even though they ought to be based entirely on language and not script. Darn.

The two constructs have a few things in common:

users think of both of them as being "characters" no matter how many code points are used to make them;
in some cases, character sequences are both of them.

And they have a few points of contrast (when a sort element is not also a text element, of course):

Text elements are in many cases subject to normalization, while sort elements never are;
Text elements can be detected by the Unicode character properties of their underlying characters, while sort elements cannot;
Text elements are a defined term in the .NET Framework, while sort elements are currently only defined as a term in Microsoft software on this blog and in presentations that Cathy Wissink and I have done about collation;
There are functions to directly detect text elements, but none to directly detect sort elements.

Probably enough to distinguish between these two types, for now.... :-)

This post brought to you by "ʤ" (U+02a4, a.k.a. LATIN SMALL LETTER DEZH DIGRAPH)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/07/25 Let's save some time and call them all IRregular expression engines

2008/01/25 On reversing the irreversible (grabbing the data, part I)

2007/09/15 A&P of Sort Keys, part 5 (aka EXPANSIONing your horizons)

2007/01/15 With SQL Server (and SQL itself) comes the illogic of 'trailing spaces' (and the myth of fixed width)

2006/10/04 Wild[card] thing, You make my CHAR sing

2005/11/13 Hungarian is even more complicated than I thought

2005/07/20 More on sort elements

go to newer or older post, or back to index or month or day