UCS-2 to UTF-16, Part 1: Getting the obvious out of the way

by Michael S. Kaplan, published on 2008/09/08 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/09/08/8931641.aspx

Previous blogs in this series of blogs on this Blog:

By the end of this series, all of my regular readers or even cursory followers know that I'll be talking you into all kinds of corner cases.

Here in the beginning, Iam going to start with the obvious.

There is one reasonably simple flaw in the whole desire to "move from UCS-2 to UTF-16", one that really should come out now.

The premise is complete and utter crap.

Yes, that's right.

The premise is complete and utter crap.

For the vast majority of all operations, for the great bulk of possible things that computer programs do with text, support of Unicode is all one needs.

The issue is largely as people stare in the abyss as they are thinking beyond the BMP of Unicode, forgetting in their rush to believe there is a huge work item to consider that this not a battle -- that UCS-2 vs. UTF-16 is not quite Kramer vs. Kramer.

For most of what happens, the project was done before it started and you were freaking out for nothing.

Asking the question (as was done for example back in 2005) Is SQL Server really supporting UTF-16? really misses the point that for most operations, it is.

All that UTF-16 adds to the whole situation is a single example of a problem that exists in UCS-2 and actually in UTF-8 and UTF-32, as well as in UTF-16.

The problem is that question Raymond Chen raised 20 months ago in his blog What('s) a character!.

It is that most times a character is a storage character, and that occasionally it is a linguistic character. Forget about the base character combined with a buttload of diacritics scenario, there are plenty of valid scenarios too. But none of them are the default case, or the most common scenarios.

To be honest, after considering situations like Vowel DISharmony?, aka The case of the missing dot, anyon who thinks that the biggest problem here is a split surrogate pair (which would show that square box something like an unknown character) and not the linguistic characters potentially stripped of their actual meaning and validity, likely needs a vacation.

They're important, sure. But they are not world stoppers (or if they are, then they were evn back when the application claimed to support Unicode yet clipped diacritics just as indiscriminately and probably more often).

Thus the first and most important point is that we need to upgrade the question into somethin more meaningful, something that covers what we are actually trying to accomplish.

Next time I'll start talking about the cases where one ought to care, and how to make sure it a productive kind of caring.

This post brought to you by ı (U+0131, a.k.a. LATIN SMALL LETTER DOTLESS I)

no comments

referenced by

2009/06/29 UCS-2 to UTF-16, Part 11: Turning it up to Eleven!

2009/06/10 UCS-2 to UTF-16, Part 10: Variation[ Selector] on a theme...

2008/12/16 UCS-2 to UTF-16, Part 9: The torrents of breaking CharNext/CharPrev

2008/12/09 UCS-2 to UTF-16, Part 8: It's the end of the string as we know it (and I feel ellipses)

2008/12/04 UCS-2 to UTF-16, Part 7: If it makes the SQL Server columns too small then it made the Oracle columns either too smallER or too smallEST

2008/11/24 UCS-2 to UTF-16, Part 6: An exercise left for whoever needs some exercise

2008/10/15 UCS-2 to UTF-16, Part 5: What's on the Next Level?

2008/10/06 UCS-2 to UTF-16, Part 4: Talking about the ask

2008/09/18 UCS-2 to UTF-16, Part 3: It starts with cursor movement (where MS simultaneously gets better and worse)

2008/09/15 UCS-2 to UTF-16, Part 2: A&P of a 'linguistic character'

go to newer or older post, or back to index or month or day