It doesn't really support UTF-8

by Michael S. Kaplan, published on 2007/08/16 17:29 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/08/16/4421520.aspx


Aaron asks (via the contact link):

Michael,

Issue that I hope you can help with. Looks like the default on MACOSX is to do samba information using UTF-8-MAC, which is apparently decomposed normal form, rather than the normal UTF-8.

NetServerGetInfo doesn't seem to understand this, so assumes the character's are standard UTF-8, and thus the default "comment" from a Macintosh NetBios is wrong - it's supposed to be "Aaron's computer" (where the apostrophe is a smart quote), but instead, it's "AaronÕs Computer" (0xC3 0x95)

Is there any builtin windows functions (not .NET) to convert between UTF-8 that windows uses and UTF-8 that Macs use?

Thanks,
Aaron

FYI to everyone, the diagnosis above is wrong. I'll get to that in a minute.... 

Now, the normalization functionality makes it easy to move between Unicode normalization forms D and C, though the Windows function that does this (NormalizeString) requires UTF-16.

So you would have to convert the string to UTF-16 first via MultiByteToWideChar, use normalization, then convert it back to UTF-8 via WideCharToMultiByte when you are done.

LUCKILY for Aaron, he does not have to do any of that, because the truth is that NetServerGetInfo does not use UTF-8, decomposed or otherwise. And the string in question is not decomposed anyway -- this is just as case of taking a UTF-8 string and pretending it is a string in the default system code page!

This will work great for ASCII, but anything beyond it will likely be wrong.

However, Aaron's problem is easily solved via a call to MultiByteToWideChar ti convert the UTF-8 data to UTF-16, followed by a NetServerGetInfoW call rather than a NetServerGetInfoA one, and everything will work! :-) 

 

This post brought to you by Õ (U+00d5, a.k.a. LATIN CAPITAL LETTER O WITH TILDE)


Aaron on 16 Aug 2007 7:36 PM:

Thanks for the quick response - i'll give that a shot. What's interesting here is that the Shell doesn't do this - Macintosh comments tend to show up as squares (or the Õ) in the explorer UI, depending on where you look - but not the smart quote, as expected.

Michael S. Kaplan on 16 Aug 2007 7:59 PM:

That actually sounds like a Shell bug! :-)

Michael S. Kaplan on 16 Aug 2007 8:25 PM:

Well, actually it probably is something underneath the Shell that is making the mistake....

Dean Harding on 16 Aug 2007 9:06 PM:

Actually, I think problem here is that the right-single-quote in the MacRoman codepage is 0xD5[1] but in Windows codepage 1252, 0xD5 is Õ.

Now, NetServerGetInfo is a Unicode-only function, but from what I understand, the NetBIOS network protocol is byte-oriented and so the data is *always* converted using the default codepage anyway[2]

So it looks like there's really nothing that can be done. You can possibly fix the problem in your own app by somehow "detecting" that the server is a Mac and doing an internal MacRoman->Win1252 conversion, but it'll always look wrong in Windows Explorer for example. But I think the "real" solution is to just stick with pure ASCII and live with the limitation :-(

[1] See: http://www.alanwood.net/demos/charsetdiffs.html

[2] Larry Osterman somewhat describes the problem here: http://blogs.msdn.com/larryosterman/archive/2007/07/11/how-do-i-compare-two-different-netbios-names.aspx

Aaron on 17 Aug 2007 10:40 AM:

Thanks Dean - your explanation helps make a lot more sense. I'd agree with living with pure ASCII as a solution, but it's the default for Mac's to put the 0xD5 into the comment field. So all Macintosh's will always exhibit this behavior, which is quite frustrating.

Netbios name seems to be correct, as Apple isn't using ASCII only there, so we are fortunate at that level. The issue is only in the comment field.

Is there any MultiByteToWideChar or other built-in API support to convert from MacRoman to CP_ACP (or CP_UTF8)?

Michael S. Kaplan on 17 Aug 2007 10:59 AM:

Agree++ that it needs to be handled, and Windows should do better interpretting here.

NLS makes available code page 10000 for Mac Roman, so we *could* convert this sucker. Someone just has to do the work....

Michael S. Kaplan on 17 Aug 2007 11:01 AM:

Aaron -- you need to think UTF-16 on Windows, not UTF-8. :-)

Yes, there is easy conversion from cp10000 to UTF-16.

Mihai on 17 Aug 2007 1:08 PM:

"So you would have to convert the string to UTF-16 first via MultiByteToWideChar, use normalization, then convert it back to UTF-8 via WideCharToMultiByte when you are done."

This does not work for the same reason why case conversion on Windows file system cannot be handled by normal Windows API. You cannot reproduce the file system behavior "outside" the file system.

- The tables used by Windows to do normalization are different from the tables used by Mac OS X to do normalization.

- It also seems like Mac OS X does not use the standard normalization anyway ("HFS Plus uses a variant of Normal Form D": http://developer.apple.com/qa/qa2001/qa1173.html).

- And the normalization tables changed between the various Mac OS X versions

Michael S. Kaplan on 17 Aug 2007 1:19 PM:

We're talking about the comment field, not the name. I think cp10000 to Unicode to display it should work just fine!

Mihai on 18 Aug 2007 12:35 AM:

"We're talking about the comment field, not the name"

Oups, sorry, my bad :-(


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day