Should considering UTF-16 be harmful be considered harmful?

by Michael S. Kaplan, published on 2012/04/27 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2012/04/27/10298345.aspx

Like many of the people I know, I find myself looking over at Stack Overflow and related sites periodically, sites like programmers.stackexchange.com.

I'm usually pleased to pop in.

I'll admit it is seldom relevant to me these days, since I don't do so much dev work, and the work I do is usually just in my particular area.

But it still can be interesting.

Once in a while, it is even in my area!

For example, there is a question about code page conversion that touches on my "Pseudo Form V Normalization" issues here, and as far as I can tell the problem was solved independantly (the article cites a few of my blogs but does not seem to notice the Form V ones)....

And like the other day, when I saw yet another pingback to my blogs BACKSPACE vs. DELETE and I think MaxLength needs protection to assure safer text, the latter of which also includes a comment by regular reader Yuhong Bao pointing to the Stack Overflow article that this blog today is kind of about:

Should UTF-16 be considered harmful?

At the time, my comment to Yuhong Bao's link was:

I find that article to be rather naive, alarmist, and biased, myself.

This roughly mirrors my current feelings on the subject.:-)

One can argue about how complicated UTF-8 is given its crazy character boundaries.

Or about how huge UTF-32 with empty space in every character, and how it routinely fools people who should know better that it fixes the problems of UTF-16.

Or, as with that blog, with whether UTF-16 is harmful.

I find my UCS-2 to UTF-16 series, with its mix of bug reports of best practices and aspirational suggestions to be a much more reasonable about improving your code.

It was almost like one of those trains that you could get off of at any stop -- you never have to ride it to the end if it takes you as far as you wanted to go.

Now contrast that with Should UTF-16 be considered harmful?, which is not really built to be helpful, even as it catalogs various problems.

By no means is it Stack Overflow at its best....

Now there is also a ton of useful content, too.

Maybe if this article didn't keep sending me pingbacks to remind me it's there, I wouldn't feel the need to comment. :-)

Yuhong Bao on 27 Apr 2012 5:57 PM:

Recently the WHATWG Encoding Living Specification classed UTF-16 as legacy:

mail.apps.ietf.org/.../msg02043.html

WndSks on 27 Apr 2012 8:24 PM:

Do you have an actual account so you can respond/help?

There is nothing like getting a answer from "the source", here are two of them:

stackoverflow.com/.../raymond-chen

stackoverflow.com/.../larry-osterman

John Cowan on 28 Apr 2012 12:34 PM:

UTF-16 is legacy on the Web (less than 0.1% of all pages). Internally, it is anything but.

Pavel Radzivilovsky on 2 May 2012 7:23 AM:

Dear Michael,

I really suggest you invest some time in reading www dot utf8everywhere dot org. I hope this has potential to convince you. Following the discussion which you mentioned, started by none other than by the author of Boost.Locale, we compiled all arguments and counter-arguments and addressed them in this document.

I'd really appreciate your opinion on that.

Thanks.

Michael S. Kaplan on 2 May 2012 8:16 AM:

Interesting -- and unrealistic....

Joshua on 3 May 2012 1:52 PM:

That page raises something I hadn't noticed before. No way to write *portable* programs that are Unicode aware.

Michael S. Kaplan on 3 May 2012 2:30 PM:

Again, interesting yet unrealistic to think it matters to even 0.05% of developers in the real world.

B. Bill on 3 May 2012 3:12 PM:

Interesting yet unrealistic to think your UCS-2 to UTF-16 matters to even 0.05% of developers in the real world. No one, and I repeat it, *NO ONE*, except those who write Unicode algorithms, or text rendering engines, should care about encodings. The only way to do this is to standardize on *one* encoding. And the more you resist, the more harm you do to the world.

Michael S. Kaplan on 3 May 2012 3:45 PM:

I wish you luck in your aspirations, but no way will we ever have just one encoding form or scheme.

Life is about dealing with things as they are; deprecating thousands of functions and dozens of programming languages affecting hundreds of millions of people is never gonna happen.

pavel on 27 May 2012 1:12 PM:

Michael,

since utf8everywhere is in the air and has 200 visitors per day on average, I suggest you take some time to address the claims more seriously. Maybe changing the title of your post can also help :)

Thanks

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day