UCS-2 vs. UTF-16 (not quite Kramer vs. Kramer)

by Michael S. Kaplan, published on 2005/05/11 17:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/11/416552.aspx


Rasqual asked, in the suggestion box:

Windows associates the idea of Unicode with 'Wide char', that is a 2-byte long character (currently).

A comment on Raymond Chen's blog stated that Windows 2000 uses the UCS-2 representation of Unicode
and Windows XP and higher use UTF-16 (both little-endian).

Can you put up an explanation of whether a UCS-2 byte stream may be considered valid UTF-16?

The point is: can I use UTF-16 generically to handle "wide char" text or are there some caveats? Would a
call to, say, CreateFileW with a filename containing surrogates fail on Windows 2000?

Is it unsafe to assume 1 WCHAR == 1 Unicode character?

Note that this has no practical application, just things I'm wondering about.

Well, in an absolutely technical sense, at the file system level -- or even at the level where CreateFileW works, Windows is neither. The OS simply takes an array of WCHAR values with a maximum size and a null WCHAR at the end, with a small number of illegal WCHAR values representing doublequotes and such. There are all sorts of obnoxious things you can put in there -- illegal Thai sequences, unpaired surrogate code units, undefined code points -- and they will simply work. The only thing that is done is an uppercase table is consulted to create case insensitivity.

However, when you move up to the level of displaying the list of files in a directory -- the Windows Shell (which is indeed where Raymond Chen works!), suddenly we start becoming conformant to all sorts of different standards and practices. And all of those misbehaving strings suddenly don't look very good (a fact that you do not notice when its just a small array of WCHAR values). And here is where the issue of surrogate pairs gets interesting....

Now when Windows 2000 first shipped, there were not any actual defined supplementary characters (other than the Plane 14 language tags that no one liked or the Plane 15 and 16 private use characters that no one used).

Because of this more than anything else, Windows 2000 is not really "surrogate-enabled" by default. But there is nothing to stop surrogate code units from being used legally with valid pairings of high and low surrogates to represent supplementary characters. So people (if they said anything) would tend to just say it supports "UCS-2" as a shorthand for saying that it was "surrogate neutral." It had no knowledge or understanding of what these code points are, but is not actively destructive. But usually it would not come up....

By the time Windows XP, the landscape had changed a bit.

The OpenType spec extensions to support supplementary characters were mature and people were making use of them. Although there were not yet fonts shippng in the operating system, there were fonts out there, some available in Micrsosoft products and others from third parties. And anytime Uniscribe was turned on, the extra work to make sure that surrogate pairs got treated as one character (showing just one NULL GLYPH if no font was available), paryially supported in Windows 2000, became more fully supported.

At some point, a switch was flipped and everyone started talking about all of the work that had been done. But how do you describe infrastructure when you do not have fonts to actually display the characters? The only way it could be described was that we now support UTF-16, whereas before it was just UCS-2. And the whole distinction between the two platforms was made, kind of after the fact.

 

This post brought to you by "𐒀" (U+10480, a.k.a. OSMANYA LETTER ALEF)
(or U+d801 U+dc80 for people who prefer to work in surrogate pairs!)


# MGrier on 11 May 2005 6:20 PM:

Well actually if you're a /good/ UCS-2 citizen you should have rejected any 2-byte sequences in the surrogate pair range.

Maybe it's fortunate that everyone wasn't good and thus tends to just be a pipe for byte-pairs that we can redefine to USHORT, UCS-2 or UTF-16 as we see fit...

# Michael S. Kaplan on 11 May 2005 7:07 PM:

Well, that point is one others would disagree with (it would make an applictation not forward compatible).

# Qflash on 15 May 2005 2:21 AM:

RePost:
http://www.yeyan.cn/SoftwareEngineering/UTF16UCS2.aspx

# Ben Bryant on 17 May 2005 12:51 PM:

Good information, but I am not sure if the simplest part of the question was answered, and I think I can handle it: yes, "a UCS-2 byte stream may be considered valid UTF-16". i.e. if it was valid UCS-2, it surely is valid UTF-16. It gets complicated when you're talking about whether parts of the platform are really dealing with UCS-2 or UTF-16.

# Michael S. Kaplan on 17 May 2005 5:04 PM:

Yes -- what was valid UCS-2 is valid UTF-16.

Also, since each version has some rules about allowing future version code to work in it, one could say that any valid UTF-16 is vali UCS-2, also. :-)

referenced by

2008/09/08 UCS-2 to UTF-16, Part 1: Getting the obvious out of the way

2006/01/17 They don't make Null Glyphs like they used to!

2005/05/12 Thinking beyond the BMP of Unicode

go to newer or older post, or back to index or month or day