Unicode not being the default is slower and leads to bugs; maybe it ought to change?

by Michael S. Kaplan, published on 2008/03/24 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/03/24/8331966.aspx


Content of Michael Kaplan's personal blog not approved by Microsoft (see disclaimer)! 

Meanwhile over in the microsoft.public.dotnet.internationalization newsgroup, don rau asked:

I am trying to load a Korean string resource from a DLL.

I use LoadStringW after setting Thread.CurrentThread.CurrentCulture to Korean. 

The string I'm trying to receive should be comprised of the following 5 characters:
0xc6a9 0xc9c0 0x002f 0xd488 0xc9c8

however what I get is:
0x00a9 0x00c6 0x00c0 0x00c9 0x002f

Note the relationship between what I expected and what I received.

Anyone have ideas on what I'm doing wrong?

You might be able to see the pattern and know what is going on already. You might incorrectly be reminded of Behind 'How to Break Windows Notepad' but that isn't it since there is no guessing going on here....

They wanted "용지/품질" and they got "©ÆÀÉ/" instead!

Regular reader Mihai had the answer for don rau:

Thread.CurrentThread.CurrentCulture should not affect the result.
Most likely you declared that LoadStringW uses ANSI strings.
Take a look at [MarshalAs(UnmanagedType.LPWStr)]

That is one way to do it; the other way is to use something like the following p/invoke signature:

[DllImport("user32.dll", CharSet=CharSet.Unicode, EntryPoint="LoadStringW", ExactSpelling=true)]
static extern int LoadString(IntPtr hInstance, uint uID, StringBuilder lpBuffer, int nBufferMax);

Then you can let the .NET Framework do all the marshaling for you (I always prefer explicit entry points and exact spellings to save the real problems associated with the weird guessing that .NET does as part of its effort to be Putting the *backward* in backward compatibility and choosing a terrible CharSet default....

Which leads me to wonder, actually. If Visual Studio can default new C++ projects to Unicode (as they now do, ref: The Unicode train is leaving the station), why couldn't they change the default for new projects to use CharSet.Unicode and all of the related behaviors in a new version of .Net?

All they would need to do is add a new project level property for the "Default CharSet value to use" -- if it is not specified (like all pre-existing projects would be), then use CharSet.Ansi. And then have all of the new projects for the next version default to set that new property to CharSet.Unicode. And that's all you need to move everyone over to using Unicode.

Kim, are you reading this? Maybe you could help push this where it needs to go over there, I'll buy lunch if you will! :-)

 

This blog brought to you by U (U+0055, aka LATIN CAPITAL LETTER U)


# Mike Dimmick on 24 Mar 2008 8:53 PM:

My god! Something that Compact Framework got right!

(CharSet.ANSI is not in .NET Compact Framework's vocabulary. Something about there not being any ANSI version of the APIs to call might have had something to do with this choice...)

# Michael S. Kaplan on 24 Mar 2008 10:45 PM:

I suspect that may be the reason. :-)


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day