The CRT and UTF-8

by Michael S. Kaplan, published on 2007/05/11 07:49 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2007/05/11/2547703.aspx


Sebastian asks:

Our runtime system is written in C and relies on standard C routines like mblen()/mbtowc() to handle multi-byte characters according to setlocale() settings...

I understand the 65001 codepage is not well supported in VC++ 7.1 and it has been de-supported in VC++ 8.

Could you please confirm this is definitive or are there any plans to support 65001 again? This is a critical question for us because we would need to handle utf-8 ourself if MSCRT does not.

We would have the option to review the code an support the "Windows Unicode" UTF-16 wide-char, but that requires a major and risky modification.

Would be fantastic if you could talk about this issue with other VC++ gurus...

Thank you very much for reading.

"De-supported" and "poorly supported" are interesting terms here. Euphamisms, really.

You could replace poorly supported with something less charitable since there really have been problems in the past any time you used anything in UTF-8 tking up more than two bytes.

And you could replace de-supported with something more charitable since the problems were recognized and the code was made more robust.

But that is all just semantics, really. Everyone knows what used to work and what does not work now....

Thinking about the effort it would take to visit every single mulibyte function and verify that it can handle - bytes rather than just one or two is a huge effort, so it shouldn't be too surprising that no one is eager to go down that road. Especially when all it really gains people is the ability to try and support Unicode without having to do their own code review to make sure their code does not have the same problems (an issue which usually turfns out to be there).

Of course when there is a problem (more often than not), Microsoft gets blamed....

I feel like any big company Microsoft makes enough mistakes that it is unfair to blame them for the things that aren't its fault. But lots of people like to do that anyway. :-(

For the record, I am unaware of plans to start robustly supporting UTF-8 in all of the multibyte C runtime functions -- as far as I know the place where UTF-8 is supported (and supported well) is in conversion to and from UTF-16.

I'm not on that team, of course. I just know how much work would be involved. As can anyone -- the source for the CRT ships with Visual Studio so the work required there is hardly a secret (and anyone on the beta will be able to tell if that kind of change was happening for the next version).

But to be honest, given that every project I have ever seen that claimed to support UTF-8 failed in the 3/4 byte cases (some even failed in the 2-byte cases!), I feel a lot motre comfortable that the right kind of work is going to happen if people convert their project to UTF-16 and that way have the opportunity to make sure that they handle everything correctly. It would continue to be what I would recommend to get the right support of Unicode in any application and I consider it neither risky nor time-consuming -- because adequate support for the world is a fairly compelling business case. :-)

I continue to look for useful samples that I can convert here and am happy to answer questions that come up. So why not move your application to Unicode, today?

 

This post brought to you by (U+0e5a, a.k.a. THAI CHARACTER ANGKHANKHU)


# Mihai on Friday, May 11, 2007 12:38 PM:

I can see where the "Unicode with UTF-8" idea comes from: the Unix/Linux world.

It works. Now, I have some problems with it UTF-8 in the belly of an application, and in the long run I think a wide encoding (utf-16/utf-32) it is a better idea.

But thing is, many applications need zero changes to work with UTF-8 in that scenario (especially if they don't do any text manipulation).

# Michael S. Kaplan on Friday, May 11, 2007 12:41 PM:

The number of apps that have a CRT dependency that have NO text manipulation yet need UTF-8 support is very small, of course....

# Nick Lamb on Saturday, May 12, 2007 5:38 AM:

Mihai, UTF-8 comes from Ken Thompson & Rob Pike when they were working on Plan 9, which also followed your thinking and had a "wide character" type called a Rune which the designers thought might assist Unicode programming. UTF-8 was presented to Unix conferences (because the Plan 9 team felt that their operating system was a natural successor to UNIX), and given Unicode standard encoding status.

Runes went nowhere, UTF-8 became the de facto character encoding of Unicode locales on UNIX and Unix-like systems, and from there conquered the visible world.

Note that after UTF-8 was invented Plan 9 was converted to UTF-8 from using a Windows-style 16-bit 'wide character' as its internal character type. If you were right to be squeamish about UTF-8 internals this would have been a mistake, but the Plan 9 team and many other succesful Unicode projects found that it was actually a better way to go, because it makes you actually do the work up-front. The alternative is... well, when I look at most software that chose to use a 16-bit wchar_t representation and search for their mention of Unicode I tend to find an asterisk in the feature list like this*

* Only the Basic Multilingual Plane is supported in this release. Sorting supported only for European languages. Using other Unicode characters may have undefined effects. Some other restrictions apply, please see documentation for details. Sold by weight, contents may have settled.

I like to think of this as an example of how programmers get tired easily despite being sat down, usually in an air-conditioned room. The programmer choosing UTF-8 starts with a program that works, but only for ASCII. By the time he's implemented enough standard compliance to have German working well and go home, support for Linear B and Japanese just falls out naturally. But after all that work of adding Ws and Ls everywhere, and replacing one function with another, and just trying to get the program to compile again  the UTF-16 programmer is tired and just wants to go home. So no actual Unicode support gets delivered, he's just converted his program to a 16-bit character type which happens to be enough to sort-of work for some locales.

# Dean Harding on Saturday, May 12, 2007 9:52 AM:

So the programmer gets tired and does a half-assed job converting a project from 8-bit characters to 16-bit characters, but apparently not the other way around?

Support for sorting and characters outside the BMP is a platform issue anyway, where does an *application* have to handle them differently?

# Michael S. Kaplan on Saturday, May 12, 2007 12:03 PM:

Hi Nick,

To be perectly honest, that has *never* been my experience. Beyond the obvious points that Dean raises, most of the "Unicode via UTF-8" applications I have been asked to review had only piss-poor support of Unicode as the developers were happy that it handled 8859-1 and they could claim Unicode support -- but often tripped as soon as you got to three byte characters and died on the vine trying to handle the four-byte ones (except for the sorse ones that detoured into the illegal six-byte ones because they never did embrace UTF-8 other than as an accident of architecture).

Maybe it is just that you know lots of non-lazy UTF-8 developers and lazy UTF-16 developers, while for me the converse has been true?

# Nick Lamb on Saturday, May 12, 2007 9:35 PM:

"So the programmer gets tired and does a half-assed job converting a project from 8-bit characters to 16-bit characters, but apparently not the other way around?"

Dean, do you think these operations are symmetrical? Computers are byte oriented. Languages that leave this sort of low-level decision up to the programmer are almost inevitably also byte oriented.

"Maybe it is just that you know lots of non-lazy UTF-8 developers and lazy UTF-16 developers, while for me the converse has been true?"

Michael, as I understand it you mostly (only?) look at software for a platform that, as this post explains, doesn't really support UTF-8, in fact it's had such embarassing UTF-8 related bugs (ignoring all the equal opportunity Unicode bugs) that it would be easy to believe that it not only doesn't have any non-lazy UTF-8 developers, it doesn't have any developers familiar with UTF-8 at all. Hence, presumably, the decision not to support 3rd parties who'd like to use it.

I didn't think it was really fair on you to concentrate on that platform, I was actually thinking of several popular portable widget toolkits and database engines although I confess that Sun's Java crossed my mind too. It's funny, an unrelated Google search brought up Markus' Unicode TN12. There, proudly presented as an example for his argument is Trolltech's Qt. That's funny because it was the example I was going to choose to make the opposite argument.

Markus claims that a popular cross platform widget toolkit library called Qt supports Unicode through UTF-16. Pretty straight forward, if he didn't have such an example it would tear a big hole in his argument. Except - when he wrote that Trolltech didn't have a version of Qt which did more than UCS-2. Despite years of bug reports from developers and end users, it took until the June 2005 release of an incompatible new major version for them to have a workable Unicode solution. Major applications began porting soon after, and are just starting to appear in the last six months. Meanwhile, the obvious alternative to Qt, GTK+ which Markus doesn't list because it uses UTF-8, had much better Unicode support.

Now I don't think Markus did this maliciously. He maybe hasn't ever used Qt, perhaps he just did a web search, or asked some friends, and found that Qt was an example of Unix software with Unicode support that had a 16-bit character type. He could have done the research to find out that it was a terrible example, but he's a busy guy. Still it's funny because if you look beyond the surface of UTN12 what you find completely undermines his argument.

# Michael S. Kaplan on Saturday, May 12, 2007 9:58 PM:

Actually, I have done application reviews across many different platforms over the last ten years. The average application is NOT cross-platform whether it claims to support Unicode or not, and most of the apps which claim to support Unicode via UTF-8 are limited to 2 or 3 bytes per character versions....

# Rosyna on Sunday, May 13, 2007 4:30 AM:

Yet another argument for abstract string types (*wink*)...

# sebflaesch on Friday, June 01, 2007 7:26 AM:

Sorry to jump in so late...

I want to thank you guys for taking my problem into consideration. I really appreciate.

Just to give you a bit more background about our case: You can compare our product to a Java compiler+VM, but in the Informix 4gl specific market. As with Java, you can develop/compile on UNIX and deploy on any other plaftorm.  Our VM can connect to Informix IDS, and I wrote the db drivers to support Oracle, DB2, PostgreSQL, MySQL, Sybase and SQL Server.

Most of our customers come from the Unix world, but more and more tend to move to Microsoft with SQL Server or Oracle.

Back to the UTF-8 / UTF-16 discussion:

Java / QT / Windows / SQL Server use UTF-16/UCS-2, ok. That's a fact and don't want to argue againts UTF-16 as don't have the skills to do that. But I know the constraints we have: There is a lot of legacy Informix 4gl source code out there we must support, written in a single-byte charset as ISO-8859-1, using Informix servers with the same encoding and thus requiring NO charset conversion between the datacase client and the runtime system. This works fast with (char *) string buffers.

There are other contraints coming from the Informix 4gl world, like the ability to write language extensions in C, requiring to support any sort of character encoding on the VM side.

We do also now have our own database server, storing data in UTF-8 (implemented with the UCI library), and I doubt it would be elegant to use 2 different encodings in our VM and in our DB server: Components of the same product line should use the same technos.

So to me the best choice for our VM is to handle strings with (char*) instead of (WCHAR/wchar_t *), and thus the de facto UNICODE encoding we would support is UTF-8, not UTF-16.

So for now the plan is to implement UTF-8 support by hand, and wrap any libc function call like fopen() to the WideChar equivalent wfopen(), by doing the conversion from UTF-8 to WideChar. Same conversions take place in the database driver using SQLPrepareW() and SQL_C_WCHAR ODBC stuff.

Thanks a lot for reading.

Seb

# sebflaesch on Monday, June 04, 2007 4:57 AM:

Note that the MSDN lib should be reviewed to remove all references to the UTF-8 / 65001 codepage:

=====================================================================

GetOEMCP

...

Note: The ANSI code pages can be different on different computers, or can be changed for a single computer, leading to data corruption. For the most consistent results, applications should use Unicode, such as UTF-8 (code page 65001) or UTF-16, instead of a specific code page.

...

=====================================================================

http://msdn2.microsoft.com/en-us/library/ms776309.aspx

I found other references to 65001 in SQL Server docs and so.

# Michael S. Kaplan on Monday, June 04, 2007 7:41 AM:

There is nothing wrong with that reference -- I am not sure I understand why it would have to change, exactly?

# sebflaesch on Monday, June 04, 2007 10:37 AM:

So the 65001 codepage is still supported by Microsoft functions like GetACP(), but setlocale() does not... Right?

Then somewhere in the CRT LIbrary Reference, I should find what code page I can use in setlocale()... this is not very clear to me.

It gets even more confusing as you can read that setlocale(LC_ALL,"") defaults to the system-default ANSI Code Page...

Can the system-default ANSI Code Page be 65001? And if yes, what happens with setlocale() dependent functions such as mblen()?

# Michael S. Kaplan on Monday, June 04, 2007 1:20 PM:

Um, huh????

GetACP does not support UTF-8.

# Michael S. Kaplan on Monday, June 04, 2007 2:02 PM:

I think I see the problem here -- the docs say STOP USING CODE PAGES, USE UNICODE INSTEAD and because it mentions both UTF8 and UTF-16, you take that to mean that is is saying GetACP nd GetOEMCP could return these values.

They cannot.

# sebflaesch on Wednesday, June 06, 2007 10:12 AM:

Ok but when I read:

"... applications should use Unicode, such as UTF-8 (code page 65001) or UTF-16, instead of a specific code page."

It tells me about a code page 65001, not about "USING UNICODE INSTEAD".

May the doc should just say:

"... applications should use the native UNICODE encoding for Windows systems, which is UTF-16, instead of a specific code page."

# Michael S. Kaplan on Wednesday, June 06, 2007 11:32 AM:

Except there are definite uses for UTF-8 -- it is the default encoding form of Unicode for the web, for example. The use of code pages for file formats and even transmission formats still happen, and that is something that is lossy (in fact, getting the ACP of the OEMCP and using it for such operations is guaranteed to be lossy for most of Unicode).

Thus someone asking the system for a code page via GetACP() or GetOEMCP() is well advised to consider UTF-16 *or* UTF-8, depending on what they plan to do with the code page. An application that uses either or both of them properly when required will be better asble to handle the full scope of Unicode....

# joehtg on Thursday, December 27, 2007 9:40 AM:

Thanks for clarifying the status of UTF-8 support in CRT.

I fully agree with you that UTF-16 is the way to go.

I have migrated the sources I am maintaining to UTF-16 with the help of ICU (IBM Opensource International Components for Unicode) on Unix and Windows.

But sometimes you have to cooperate with the outside world.

I have never heard of a UTF-16 telnet session, and I am used to work with Oracle-sqlplus  on the command line on all platforms with UTF-8 encoding.

When I tried to display the Euro sign with sqlplus on the windows command line,

I came across two of the bugs mentioned above:

1) chcp 65001 makes cmd ignore all subsequent batch commands silently (.cmd, .bat)

2) MSVCRT6,7,8: fwrite stdout does not work if the first character in a line is above 127.

If the first byte in a line is below 128, the remaining UTF-8 characters are displayed correctly in Lucida Console.

  const char Euro[] = { 0xe2, 0x82, 0xac, '\n' }; // Euro sign in UTF-8 encoding

  SetConsoleOutputCP(CP_UTF8);

  // setvbuf(stdout, outbuf, _IOLBF, sizeof(outbuf));

  rc = fwrite(Euro, sizeof(Euro), 1, stdout );

For some strange reason, unbuffered fwrite makes two _write calls for these 4 bytes, first call one byte, next call remaining 3 bytes. This separation breaks UTF-8 output to Console.

With setvbuf(), the code displays the Euro sign.

I have to admit that I am a lazy programmer too, while our customers are requesting support for the languages spoken in the European Union Real Soon Now, I believe that I will be retired before the first bug related to a surrogate pair shows up.

# Arjan van Bentem on Friday, August 15, 2008 8:35 AM:

> chcp 65001 makes cmd ignore all subsequent batch

> commands silently (.cmd, .bat)

A workaround:

chcp 65001 && your_command_here ...

Also note that one should not use "raster fonts" (instead: use Lucida Console) within  the cmd.exe window.

# Yuhong Bao on Tuesday, October 21, 2008 11:09 PM:

"* Only the Basic Multilingual Plane is supported in this release. "

Another possibility here is that the code was originally designed for UCS-2 and thus do not handle surrogates properly.

# yuhong2 on Friday, March 05, 2010 9:15 PM:

"most of the "Unicode via UTF-8" applications I have been asked to review had only piss-poor support of Unicode as the developers were happy that it handled 8859-1"

Or DBCS!


go to newer or older post, or back to index or month or day