by Michael S. Kaplan, published on 2005/04/07 02:03 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/04/07/406060.aspx
The definition of many words depends on the context.
For example, a typographer (or a dictionary) might consider a ligature to be a character consisting of two or more letters combined into one.
But if you work with keyboards on Windows then you might fall into the trap of the "ligature table" and think that a ligature is just any string of two to four UTF-16 code points that you stick on a key to appear sequentially.
If you follow international industrial standards, you may consider Unicode to be a standard that "...provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language."
But if you are dictionary.com or you only work on Microsoft platforms, then to you Unicode may seem to be a "...16-bit character set standard..." and not really not think about the fact that the code space goes beyond that.
If you deal with fonts or GDI APIs then you may think of a charset as something that is not exactly a code page but if you are building HTML page then you may think it is exactly a code page.
And all of this leads up to the actual point of this post....
In the suggestion box, Eusebio Rufian-Zilbermann asked a question:
What is a "character" exactly?
During the Windows Security Push, Microsoft introduced a set of new string handling functions that use "character counts" (declared in strsafe.h together with other functions). Did anybody from Globalization Infrastructure (or a previous incarnation, back in 2002) provide feedback into using the term "character count"? The problem I see is that counting "characters" only makes sense for initialized strings and it gets messy when talking about uninitialized items. Specifying that the size of an unitialized buffer is X number of "characters" is just trouble waiting to happen as soon as surrogates (or "old-style" multibyte) get into the picture. At least the library documentation should be clearer that the destination size is really specified in number of _TCHARs (which is not always the same as number of characters).
At some levels, like the level of an API that accepts a cch or "count of characters", a character is a UTF-16 code point, on Microsoft platforms a WCHAR or wide character. If you are the caller of such an API then your chief concerns will be about buffer sizes and their allocation, so you would probably be on that same level.
But if you are an API that has to intelligently move the cursor appropriately or the user moving the cursor, then a character may well be made of more than one of these code points. It will be what the .NET Framework calls in its StringInfo and TextElementEnumerator classes a "text element".
This is also what Mark Davis (president of the Unicode Consortium) in an effort to separate the two definitions has dubbed a "grapheme cluster", a "...particular text element defined in Unicode Standard Annex #29, “Text Boundaries,” consisting of any of the following: an atomic character, a combining character sequence consisting of a base character plus one or more nonspacing marks or enclosing marks, or a sequence of Hangul jamos equivalent to a Hangul syllable."
(I must admit that I every time I hear the term I cannot help imagining a bunch of Keebler elves coming out of Mark's office thanking him for giving their new cookie line a great name!)
In fact, it is fair to say that any time the question "how long is the string" is asked that there are two completely different answers, both of which may be totally different yet are nevertheless both right. All depending on the context.
This post brought to you by "œ" (U+0153, a.k.a. LATIN SMALL LIGATURE OE)
That was LATIN SMALL LIGATUE OE, rush chairman. He was damn glad to meet you...
2006/06/05 Is SQL Server really supporting UTF-16?
go to newer or older post, or back to index or month or day