by Michael S. Kaplan, published on 2010/09/01, original URI: http://blogs.msdn.com/b/michkap/archive/2010/09/01/10056685.aspx
The question:
Hello,
Can anyone please explain why I loose my non English character in a string when trying to convert an STD::string to Unicode when using CP_UTF8 code page and not when I use Ansi code CP_ACP or CP_THREAD_ACP, it works, I always thought std::string is utf_8 encoded!
Thanks in advance,
Yassine
Std::string str(“tést”);
int size = MultiByteToWideChar(CP_UTF8,NULL ,str.c_str(),-1,NULL,0);
if( size == 0 )
{
printf("Failed to get the multi byte size [%d]\n",GetLastError());
return false;
}
printf("Multibytetowidechar returned the size of: %d \n",size);
*wstr = new WCHAR[size];
if( MultiByteToWideChar(CP_UTF8,NULL ,str.c_str(),-1,*wstr,size) == 0 )
{
printf("Failed to convert multi byte to wide char [%d]\n",GetLastError());
delete[] *wstr;
*wstr = NULL;
return false;
}
The answer is kind of in there, just a little bit.
Can you find it? :-)
It is right here:
I always thought std::string is utf_8 encoded!
Because it isn't.
If you want to store Unicode text, let's look to the basic_string docs:
CharType
The data type of a single character to be stored in the string. The Standard C++ Library provides two specializations of this template class, with the type definitions string, for elements of type char, and wstring, for elements of type wchar_t.
Now the fact that you can store any encoding such as UTF-8 in a regular char/CHAR does not mean that higher level code like STL won't "helpfully" make assumptions about what *it* thinks he encoding is. As a hint, it's not either "left as is, do what you want" or UTF-8....
Now if you ask me, UTF-8 could potentially be a useful thing to implement as another specialization, for people who want to work extensively with UTF-8 directly. But that is a separate issue....