Those chars aren't the UTF-8 you're looking for. Move along...

by Michael S. Kaplan, published on 2010/09/01, original URI:

The question:


Can anyone please explain why I loose my non English character in a string when trying to convert an STD::string to Unicode when using CP_UTF8 code page and not when I use Ansi code CP_ACP or CP_THREAD_ACP, it works, I always thought std::string is  utf_8 encoded!

Thanks in advance,

Std::string str(“tést”);
int size = MultiByteToWideChar(CP_UTF8,NULL ,str.c_str(),-1,NULL,0);
    if( size == 0 )
        printf("Failed to get the multi byte size [%d]\n",GetLastError());
        return false;
     printf("Multibytetowidechar returned the size of: %d \n",size);
    *wstr = new WCHAR[size];
    if( MultiByteToWideChar(CP_UTF8,NULL ,str.c_str(),-1,*wstr,size) == 0 )
        printf("Failed to convert multi byte to wide char [%d]\n",GetLastError());
        delete[] *wstr;
          *wstr = NULL;
        return false;

The answer is kind of in there, just a little bit.

Can you find it? :-)

It is right here:

I always thought std::string is  utf_8 encoded!

Because it isn't.

If you want to store Unicode text, let's look to the basic_string docs:


The data type of a single character to be stored in the string. The Standard C++ Library provides two specializations of this template class, with the type definitions string, for elements of type char, and wstring, for elements of type wchar_t.

Now the fact that you can store any encoding such as UTF-8 in a regular char/CHAR does not mean that higher level code like STL won't "helpfully" make assumptions about what *it* thinks he encoding is. As a hint, it's not either "left as is, do what you want" or UTF-8....

Now if you ask me, UTF-8 could potentially be a useful thing to implement as another specialization, for people who want to work extensively with UTF-8 directly. But that is a separate issue....

comments not archived

go to newer or older post, or back to index or month or day