UCS-2 to UTF-16, Part 2: A&P of a 'linguistic character'

by Michael S. Kaplan, published on 2008/09/15 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/09/15/8952073.aspx

Previous blogs in this series of blogs on this Blog:

A&P in the title stands for Anatomy and Physiology, since in some alternate universe I went ahead and got a medical degree and made a good friend (a friend who, in that alternate universe, is still alive) proud of me. Ignore it, the deeper meaning of the title, even when it exists, isn't really important. :-)

Now that I've made everyone thinking "Let's update to support UTF-16 instead of UCS-2" they need to just back the hell off a few steps with the previous blog, I thought it might be good to go a little deeper in so you can see that even though you may have been completely and totally wrong, that there is a good basis for you thinking the way you were, and that you can use that knowledge to feel better about future steps. :-)

In theory, there is very little difference between the general case of linguistic character as I defined it last time and the specific case that got everyone freaking out about UCS-2 (surrogate pairs).

In practice, all linguistic characters fall into one of ~~two~~three categories:

A Surrogate Pair (two code units, a high and a low surrogate), neither of which is itself a character, linguistic or otherwise. The cheese may stand alone, but surrogate code units didn't teach it how, if you know what I mean;
A Grapheme Cluster (to use Unicode's term) aka Text Element (to use Microsoft's in the .NET Framework), made of two or more code units, at least some though not all of which can be independently thought of as being linguistic characters themselves;
A Sort Element (to use my term, via this blog) aka Compression (to use Microsoft's term) aka Contraction (to use Unicode's), made up of two or more code units, all of which can be independently thought of as being linguistic characters themselves.

To show an example of each:

𐎀, aka UGARITIC LETTER ALPA, aka U+10380, aka U+d800 U+df80 -- this one is four bytes in UTF-8, two code units in UTF-16, and one code unit in UTF-32 -- interestingly, always four bytes!
Ṹ, aka the fully decomposed form of U+1e78 (LATIN CAPITAL LETTER U WITH TILDE AND ACUTE), aka U+0055 U+0303 U+0301 -- this one is five bytes in UTF-8, three code units or six bytes in UTF-16, and three code units or 12 bytes in UTF-32;
dzs, a sequence of letters that collates together in Hungarian, aka U+0064 U+007a U+0073 -- this one is three bytes in UTF-8, three code units or six bytes in UTF-16, and three code units or 12 bytes in UTF-32.

Now one can argue at length on relative consequences of truncation of any of these sequences of code units. You might even make an argument that truncation is most serious in the first case and then gets less and less serious as you go down the list.

Truncation in this case is a superset of any operation that splits apart the component pieces before a user's eyes, including cursor movement through the string, deletion of a single "character" via the delete key, cutting off the end to fit in a buffer, or whatever. Anything that would show a lack of respect for a linguistic character's boundaries. Everyone gets involved here -- fonts, keyboards, you name it...

From one point of view you might be right if that is your argument.

But as long as we are choosing to call them linguistic characters I am going to channel that Spock-with-a-beard version of me that managed to avoid the scandal with the Dean's daughter and got a PhD in linguistics, and claim that they each have the potential to have unique meaning to a user who took the time and effort to put them into data.

In my opinion, you get no points for vicious truncation just because it doesn't look as bad.

And in which case anyone with eager willingness to truncate should consider themselves to be a bloodthirsty linguistic character murderer. Sentence suspended by me since there really isn't a competent court with the authority to punish for this crime. :-)

Because if you are working on or using a computer program displaying or storing or in any way using data then you have a right to not have someone change the meaning of that data in the name of expediency.

And truncating a linguistic character has the potential to do just that.

Okay, now that I have been all crazy about this, I'll point out that only the first two of these three categories have any supported way for a program looking for safe truncation points to detect them.

Which means if I made you feel guilty, you can take some solace in the fact that just about everyone is going to be doing it some of the time....

But it is worth considering that fact when one carefully does one's best avoiding problems with the categories that you can easily help with.

Okay, that's it for now, next time I'll talk about those various operations an how to go about them....

This blog brought to ou by Ṹ (U+1e78, aka LATIN CAPITAL LETTER U WITH TILDE AND ACUTE)

# Mihai on 15 Sep 2008 1:29 PM:

And let's not forget the IVS (Ideographic Variation Sequences) :-)

# Michael S. Kaplan on 15 Sep 2008 1:35 PM:

No worries there -- for our present purposes, they fall into Category #2. :-)

# John Cowan on 15 Sep 2008 2:17 PM:

"[...] at least some though not all of which can be independently thought of as being linguistic characters themselves" isn't strictly true. Decomposed Korean syllables are grapheme clusters, but each component is a linguistic character.

# Michael S. Kaplan on 15 Sep 2008 2:54 PM:

Ah yes, that is true. Though most users would consider the net effect of truncation to be just as destructive to meaning....

# John Cowan on 15 Sep 2008 10:23 PM:

Well, the same is true of almost any sort of trunca

# Michael S. Kaplan on 16 Sep 2008 12:18 AM:

Exactly my point. :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2009/06/29 UCS-2 to UTF-16, Part 11: Turning it up to Eleven!

2009/06/10 UCS-2 to UTF-16, Part 10: Variation[ Selector] on a theme...

2008/12/16 UCS-2 to UTF-16, Part 9: The torrents of breaking CharNext/CharPrev

2008/12/09 UCS-2 to UTF-16, Part 8: It's the end of the string as we know it (and I feel ellipses)

2008/12/04 UCS-2 to UTF-16, Part 7: If it makes the SQL Server columns too small then it made the Oracle columns either too smallER or too smallEST

2008/11/24 UCS-2 to UTF-16, Part 6: An exercise left for whoever needs some exercise

2008/10/15 UCS-2 to UTF-16, Part 5: What's on the Next Level?

2008/10/06 UCS-2 to UTF-16, Part 4: Talking about the ask

2008/09/18 UCS-2 to UTF-16, Part 3: It starts with cursor movement (where MS simultaneously gets better and worse)

go to newer or older post, or back to index or month or day