It isn't Unicode, it's Double Secret Unicode!

by Michael S. Kaplan, published on 2005/10/28 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/10/28/486019.aspx

Visual Basic ≤ 6.0 has Unicode strings. But as I point out in Chapter 6 of my book, it converts out of Unicode to the default system code page any time you do just about anything to those strings.

Lots of people try to avoid this by calling the StrConv function with the vbUnicode parameter first, figuring they convert the string to Unicode, VB converts it back out, and all is well.

Ugh.

Remember how I said that VB strings were already Unicode strings?

Well, calling StrConv to convert a Unicode string to Unicode is not a no-op; it actually converts the string to Double Secret Unicode, an encoding not found in nature.

So your string:

My name is Michael.

which is laid out as follows:

004d 0079 0020 006e 0061 006d 0065 0020 0069 0073 0020 004d 0069 0063 0068 0061 0065 006c 002e

will be converted by VB (which thinks it is not Unicode since you are converting it) to:

0000 004d 0000 0079 0000 0020 0000 006e 0000 0061 0000 006d 0000 0065 0000 0020 0000 0069 0000 0073 0000 0020 0000 004d 0000 0069 0000 0063 0000 0068 0000 0061 0000 0065 0000 006c 0000 002e

and this "Double Secret Unicode" is basically meaningless.

Luckily when VB converts it out of Unicode you will get what you are looking for. So maybe you converted twice for no reason. No need to worry, right?

Wrong.

If you look to my posts The new compiler error C4819 and How does it detect invalid characters? then you'll start to get a glimmer of where this method gets into trouble.

Because for most strings in Unicode you do do not have those things that look like NULL values. And because any time you are on a machine whose default system code page does not contain all of the characters in question (and as those posts indicated this is not unheard of), the conversion back from Double Secret Unicode will not perfectly round trip back to the original string. And once you get into Chinese, Japanese, and Korean, the exacting requirements of particular lead bytes and trail bytes can cause you to lose ideographs in all kinds of unexpected ways.

In summary, you will lose information.

The way around this is really quite simple. Any time you are going to pass a string to an external function, declare it ByVal As Long rather than ByVal As String, and then pass StrPtr(<your string>) to that Long. For example:

Public Declare Function CompareStringW Lib "kernel32" ( _
ByVal Locale As Long, _
ByVal dwCmpFlags As Long, _
ByVal lpString1 As Long, _
ByVal cchCount1 As Long, _
ByVal lpString1 As Long, _
ByVal cchCount1 As Long) As Long

Public Enum CmpFlags
    STRINGSORT = &H1000&
    IGNORECASE = &H1&
    IGNORENONSPACE = &H2&
    IGNORESYMBOLS = &H4&
    IGNOREKANATYPE = &H10000
    IGNOREWIDTH = &H20000
End Enum

Public Enum CSTR_
    LESS_THAN = 1 'string 1 less than string 2
    EQUAL = 2 ' string 1 equal to string 2
    GREATER_THAN = 3 ' string 1 greater than string 2
End Enum

Public Function CompareString(ByVal lcid As Long, ByVal flags As CmpFlags, ByVal st1 As String, ByVal st2 As String) As CSTR_
    CompareString = CompareStringW(lcid, flags, StrPtr(st1), Len(st1), StrPtr(st2), Len(st2))
End Function

And if you use this technique, the call will be faster (you avoid two extraneous conversions) and you will never lose data from conversion mistakes.

Best of all, you avoid the evil Double Secret Unicode!

This post brought to you by "٭" (U+066d, a.k.a. ARABIC FIVE POINTED STAR)

# Gabe on 28 Oct 2005 3:13 AM:

This Double Secret Unicode looks a bit too similar to UTF-32. And that Arabic 5-pointed star looks like it has 8 points. I definitely think there's some sort of conspiracy at Microsoft.

# Vorn on 28 Oct 2005 3:15 AM:

Linked image shows what I see on my Mac for your sponsoring character. this one's particularly silly.

Vorn

# Michael S. Kaplan on 28 Oct 2005 3:26 AM:

Gabe -- it is not UTF-32, trust me. It will shred anything not English....

# Nick Lamb on 28 Oct 2005 9:33 AM:

Well, U+066d is cross-referenced to U+002a (asterisk) and it's arguable that in most cases asterisk is a suitable fallback character if your system has no glyph at all for U+066d... it's certainly better than the more or less arbitrary "fallback" glyphs Apple provide.

So, Vorn, do you have any (probably Arabic) fonts with a five pointed star on your Mac? If not this result is arguably correct.

On the other hand some systems seem to get this stuff right more than others. By default the renderer should prefer U+066d from an Arabic font over U+002a from the programmer's / user's first choice font, not least because it's likely to be surrounded by other Arabic for which no such substitute is possible and a latin asterisk will spoil the flow. Fileformat.info gets this wrong (they obviously have an Arabic font, but it isn't used for this character). The Pango renderer I'm using gets it right in my web browser and other apps, but the renderer used in xterm doesn't do fallback (and of course there's no fixed pitch Arabic font), so you get the infamous dotted box. Does IE get it right on Windows, with Arabic fonts installed but not set as first choice?

# Mike on 28 Oct 2005 9:53 AM:

Funny, I was re-reading your VB book (page 166-167) about that exact topic so I could reply in the newsgroups - thanks for that expanded answer.

# Vorn on 28 Oct 2005 3:26 PM:

I have 11 fonts installed with U+066d installed, plus three font variants, for a total of 14. Three of them have eight-pointed stars (Al Bayan, Al Bayan Bold, DecoType Naskh), six of them have six-pointed stars (Baghdad, Geeza Pro, Geeza Pro Bold, Geezah, KufiStandardGK, Nadeem), and five have five-pointed stars (STFangsong, STHeiti, STHeiti Light, STKaiti, STSong). I appear to get my version of the arabic five-pointed star from Baghdad.

Vorn

# robert on 29 Oct 2005 4:12 PM:

I tried StrPtr and pass the original VB string pointer to a C program, but c program doesn't recognize it as LPWSTR, instead, just treat it as one wchar. No idea why? The vb string should be terminated by null, then the VB string char array should be a perfect LPWSTR, but why it's not recognized by C?

# Michael S. Kaplan on 29 Oct 2005 4:34 PM:

Well, it would actually be a WCHAR array. and if it is English then it will look like one byte (char) followed by a NULL. This trick is for when you want a Unicode string.... :-)

# robert on 30 Oct 2005 11:23 AM:

Just the update for my post regarding strptr and LPWSTR. finally I figured out that VC6 did work with the pointer passed by strptr, only thing is the IDE itself will not show the string content when you debug (it only show one char instead of whole string for the pointer content). If we check the memory, all string content are there and the code itself will treat the content as string and all string functions work fine.

My issue actually is when you do post in wininet. If you post the unicode string to ASP.net page, looks like the post data willl not be recognized, even I set the charset in post as unicode, and set asp.net request contentencoding as unicode.

Now the issue may be beyond the original discussion, but I would like to share what I learned form the issue: sometimes we have to convert unicode back to ANSI --- but it will be the MBCS using utf-8 encoding. Many cases utf-8 encoded ANSI (I mean single byte string) instead of unicode is what we really need. The concept of MBCS may be important for those who are using unicode for multi-language development.

# Michael S. Kaplan on 30 Oct 2005 2:32 PM:

UTF-8 is indeed important in many cases, but it is never ANSI (only the 'misnamed' ANSI code pages in Windows are that!).

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/05/19 Everyone seems averse to the BOM these days; Should we blame TSA? :-)

2006/10/08 The return of double secret Unicode!

2006/06/08 DEP is not affected by locale settings

go to newer or older post, or back to index or month or day