UTF-8 and GB18030 are both 'NT' code pages, they just aren't 'ANSI' code pages

by Michael S. Kaplan, published on 2007/01/03 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/01/03/1392379.aspx


Michael Entin asks in the Suggestion Box:

Hi Michael.

I want to revisit UTF-8 discussion.

In several posts you wrote that it is impossible to support UTF-8 as NT code page, since there is a lot of legacy code that assumes maximum of 2 bytes per char. So it is impossible to fix all this code to support UTF-8.

I don't quite understand how then does Windows support GB 18030 encoding? It appears it is a very similar encoding, where a character can be encoded by up to 4 bytes.

What are the differences between these two encodings? How come Windows can support one, but not the other?

I believe he is referring to this post and/or this post and/or the comments in this one....

And it is still true that UTF-8 (code page 65001) cannot be an ACP ("ANSI" code page") for a locale.

But from a technical standpoint, neither can GB-18030 (code page 54936) -- for pretty much the same reason.

The GB-18030 question is a bit more interesting since I am pretty sure there was an official request that we change the default system code page of the zh-CN locale to GB-18030, but unfortunately the answer was the same.

These code pages are present for people to convert things out of and to convert things into that a user might run across; they are not for the legacy ("ANSI") support in the Win32 API which, since The Unicode train is leaving the station, are not being added to or updated. So they work great in MultiByteToWideChar and WideCharToMultiByte, but the core OS is not going to updated to work internally off of either one.

Now the job would not be entirely impossible, though I suspect fairly improbable (and I say this as someone who has written a Unicode Layer for Win9x Systems (and who was asked once by another company to write a UTF-8 Encoding Layer for NT (or UELNT, I guess?), this would require a serious and non-trivial development effort, whether one is inside or outside of Microsoft. There simply isn't a specific reason or benefit to doing it that would outweigh the cost).

Now if I ever retired, that UELNT project might be something interesting to take a shot at if someone really wanted to fund it. But I would probably have to run out of other stuff to do first, and that doesn't seem likely to happen any time soon. :-)

 

This post brought to you by  (U+0edc, a.k.a. LAO KO LA)


# Adam on 3 Jan 2007 4:51 AM:

"[UTF-8 code pages] are not for the legacy ("ANSI") support in the Win32 API which, since The Unicode train is leaving the station"

Hmmmm....... it seems weird when I hear a lot of MS people and most of my cow-orkers talk about "Unicode"; they seem to think that "Unicode", "UCS-2" and "UTF-16" are all the same thing, and always use the word "Unicode" to describe all three.

Most strangely, they keep claiming that UTF-8 is *not* Unicode! I hear sentences like "Ah, no, that file is in UTF-8, you need to convert it to Unicode to [whatever]" told to customers, and it *really* puts my teeth on edge.

# Michael S. Kaplan on 3 Jan 2007 4:59 AM:

Well, you can blame the people who did the Notepad "file types" list one that one. :-)

# Stephane on 10 Jan 2009 12:41 PM:

You say that UELNT is serious non-trivia development.

Unless I'm missing something, isn't it just writing wrapper functions that would do a MultiByteToWideChar(CP_UTF8,...) on the text parameters on input, call the .....A form of the function, and do a WideCharToMultiByte on the output parameters?

Isn't that already what most .....W functions do?

Joshua on 27 Jun 2011 12:02 PM:

I ran across a case where I really wanted this: so I could declare SQL Server's VARCHAR to be UTF-8 (using NVARCHAR would have caused us to exceed then 8000 byte row limit too easily).

Our use case involved embedding short strings of non-english characters (such as names) into moderate strings of english text (say 50-1000 characters). The maximum was 5500.


referenced by

2011/06/22 There and Back Again (aka ACP --> UTF-8 --> ACP)

2007/07/11 GB18030 isn't an ACP, either

go to newer or older post, or back to index or month or day