Who broke the UTF-8 support?

by Michael S. Kaplan, published on 2006/03/13 03:21 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/03/13/550191.aspx


George asked me the other day via email:

When I make the following call in Visual C++ (Visual Studio 2003), it succeeds:

char * plocale = setlocale( LC_ALL, ".65001" );

But when I try this in 8.0 (Visual Studio 2005), plocale is NULL.

Who broke the UTF-8 locale support? And when will it be fixed?

A very similar question was actually asked on one of the internal Microsoft aliases late last year. It was answered by none other than VC++ guru Martyn Lovell:

VC7 didn’t support UTF8 (or 7) correctly. But it didn’t error, it just silently did the wrong thing.

We tried to find time to do the work in VC8 to support these codepages, but we didn’t have time.

So now, at least, we explicitly error so that you know that we don’t work right in these codepages

Martyn

So George -- saying that it used to work might be an overstatement.

But look on the bright side -- if it stops failing in the next version, its a good indication that things are now expected to work properly.... :-)


# Ben Bryant on 13 Mar 2006 7:20 AM:

Good to see this post. It leads me to wonder what effect the person was expecting setlocale 65001 would actually have if it did work (it certainly wouldn't affect the code page used in Windows APIs or messages).

Multibyte charsets is a fuzzy area in Visual C++. Most people don't know there are two families of multibyte functions (in addition to the Win32 ANSI APIs), one for standard C/setlocale, and the others are Microsoft ones based on the default system locale but can be changed on the fly. I've struggled to understand this because MSDN completely ignores the distinction and I wrote about it here:

http://codesnipers.com/?q=node/46

"locale dependent" functions: start out in "C" locale; are controlled by setlocale; use header stdlib.h; examples are mblen, isleadbyte, _mbstrlen, mbtowc, wctomb.

default system locale functions: start out in GetACP locale; are controlled by _setmbcp; use header mbstring.h; examples are _mbclen, _isleadbyte, _mbslen.

Ariel on 17 Jun 2009 4:01 AM:

Just saw this post.

I just found out that a piece of code that receives a string in a given encoding, and breaks it down to pieces. The nice thing is that it makes sure that it makes sure that the string doesn't break in the middle of a character.

The whole logic relies on _mbstrlen_l().

But now I find out that _mbstrlen_l() doesn't support utf-8, because _create_locale(LC_CTYPE, ".65001") return null.

Frustrating!

Michael S. Kaplan on 18 Jun 2009 2:29 PM:

This is misleading -- even when creating UTF-8 locales "worked", _mbstrlen_l() didn't work with it -- this function only ever worked with CJK double-byte code pages. Perhaps code that didn't expect it to work in places where it won't might be preferred?

Now that is what I find most frustrating! :-)

Sorin Ionuț Sbârnea on 29 Apr 2010 4:40 AM:

I would like to know if there is a way of using UTF-8 in your source code that would compile and run on Microsoft and non-Microsoft compiler (like gcc). There is an interesting question at http://stackoverflow.com/questions/688760/how-to-create-a-utf-8-string-literal-in-visual-c-2008 but I do not like the current accepted answer: no solution. Is this still true or the msvc 2008 or 2010 introduced some changes here?

Michael S. Kaplan on 29 Apr 2010 9:14 AM:

Not really related ssbarnea, except tangentially. Perhaps you wanted to put something in the Suggestion Box? :-)

Sorin Ionuț Sbârnea on 6 May 2010 3:09 PM:

Thanks Michael, but I'm sure that the suggestion box is already full of other stuff.

I would really want to see if you could come with a simple Unicode "¡qʃɹoʍ oʃʃǝɥ" application that will compile and run on Windows, OS X and Linux.

Michael S. Kaplan on 6 May 2010 5:28 PM:

It is not full at all; there is one item in it at the moment.

I have a lot of other topics, and that list is one of my later to-do lists; I do not track random comments in unrelated blogs. So if you are okay with me potentially never getting to it then I suppose here is fine too....


go to newer or older post, or back to index or month or day