UCS-2 to UTF-16, Part 11: Turning it up to Eleven!

by Michael S. Kaplan, published on 2009/06/29 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2009/06/29/9800913.aspx


Previous blogs in this series of blogs on this Blog:

Back when I was in school oh those many years ago, I remember learning all kinds of rules about writing essays and position papers and really anything meant to convince people of a point of view.

You know, all that "Step 1: Tell em what you're gonna tell 'em, Step 2: Tell 'em, Step 3: Tell 'em what you told 'em" and so on. Basically narrowing the world to the specific point you want to make, making the point, and then expanding that point to make its connection to the world clear. You get the point.

More recently, as I look at academic papers and books from people with advanced degrees, it is clear that they do something slightly different a lot of the time.

Rather than ending on a strong note that reinforces prior themes, they end with the things that are not yet explored, the things not fully done yet.

I originally looked at this as a sign of weakness -- why end with your weakest or least impressively thought out arguments, with the items that are not there yet?

But over time I have reconsidered this view; there is a certain strength in making it clear that there is more out there. Making it obvious that grownup problems can't always be wrapped up and delivered with a bow on them.

And that is where this last part of the whole UCS-2 to UTF-16 series will try to go.

Over many successive parts I have discussed or dare I say it proven that the people who think that UCS-2 to UTF-16 is a shorthand are 100% correct, and the people who think it os just about surrogate pairs are dead wrong.

UCS-2 to UTF-16 is about moving from Unicode code units to what the user will think of as a CHARACTER.

And about how to plan out software behavior in a way that let's ordinary users who wouldn't know Unicode from UNICEF see the behavior they expect based on what they know of their actual language, rather than their understanding of the limitations in computers over the last several decades.

But as I went through the series, I probably spent as much time pointing out failures in software to support this notion as I did successes. In software all over the place.

When you get down to it, we are still quite a long way away from this ideal tht the average user would find most empowering; we still rely on people to conform to the limitations of our machines rather than causing those machines to conform to the understandings of the users.

So in a way I have already been talking about the places that Microsoft and all of the other software companies are weak and unfinished and not fully implemented or sometimes even understood!

So here, in Part 11 of the series, I will take this catalog of misunderstandings/bugs/failures and turn it up to eleven (to use the Spinal Tap expression) and go even further....

Beyond characters there are if course words, and phrases, and clauses, and sentences, and paragraphs, and pages.

And as I pointed out in blogs like The Bidi Algorithm's own SEP Field it is clear that once you get into issues more complicated than characters, we tend toward sucking just as badly.

Or maybe worse -- in the case of characters it is failure to live up to Unicode's definition; in the case of more complex operations like bidirectional text a 100% conformant implementation will fall way short of typical native user expectations in even many of the most simple cases. We claim we are conformant, they say it requires higher level protocols to support reality, and thus we prove ourselves to be unable to reach the lofty goals of higher protocols.

We're too busy stuck in muck because we're following the standard and the standard considers it to be too much to handle.

How do we get past this and break the stalemate, exactly?

Unicode doesn't seem to be interested -- they regulaarly fiddle with UAx #9 to fix bizarre corner cases while never even attempting to tackle the easy cases like the ones I've ben railing about all this time. The ones even a child can understand like

C:\NAME ‎(BIG)‎\שם ‏(גדול)‏\NAME ‎(BIG)‎\שם ‏(גדול)‏

and such.

As I mentioned in a prior blog:

No one wants to do too much beyond Unicode even though plain Unicode alone (without making use of higher level protocols to place control characters) is insufficient for handling these cases....

Note that is also also one of the reasons RTL IDN is so complicated and looks so broken most of the time.

It all amounts to A place where everyone blows, equally.

So maybe Microsoft and the companies that claim to care about the end-to-end user experience should just choose to rise above this, to be high level protocols.

Because claiming that we are done with our core support and can now add more advanced features, when we can't even handle characters and sentences is a little bit obnoxious of us, to say the least (especially if we aren't even trying to get better!).

Now that this series is officially done, I'll maybe try in some future blogs and give some of my thoughts about what it might mean to be a higher level protocol....


# ΤΖΩΤΖΙΟΥ on 29 Jun 2009 6:24 PM:

…and of course you meant: Spın̈al Tap. Tsk, tsk. Gotcha :)


referenced by

2010/11/24 UTF-8 on a platform whose support is overwhelmingly, almost oppressively, UTF-16

go to newer or older post, or back to index or month or day