by Michael S. Kaplan, published on 2005/10/28 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/10/28/486019.aspx
Visual Basic ≤ 6.0 has Unicode strings. But as I point out in Chapter 6 of my book, it converts out of Unicode to the default system code page any time you do just about anything to those strings.
Lots of people try to avoid this by calling the StrConv function with the vbUnicode parameter first, figuring they convert the string to Unicode, VB converts it back out, and all is well.
Remember how I said that VB strings were already Unicode strings?
Well, calling StrConv to convert a Unicode string to Unicode is not a no-op; it actually converts the string to Double Secret Unicode, an encoding not found in nature.
So your string:
My name is Michael.
which is laid out as follows:
004d 0079 0020 006e 0061 006d 0065 0020 0069 0073 0020 004d 0069 0063 0068 0061 0065 006c 002e
will be converted by VB (which thinks it is not Unicode since you are converting it) to:
0000 004d 0000 0079 0000 0020 0000 006e 0000 0061 0000 006d 0000 0065 0000 0020 0000 0069 0000 0073 0000 0020 0000 004d 0000 0069 0000 0063 0000 0068 0000 0061 0000 0065 0000 006c 0000 002e
and this "Double Secret Unicode" is basically meaningless.
Luckily when VB converts it out of Unicode you will get what you are looking for. So maybe you converted twice for no reason. No need to worry, right?
If you look to my posts The new compiler error C4819 and How does it detect invalid characters? then you'll start to get a glimmer of where this method gets into trouble.
Because for most strings in Unicode you do do not have those things that look like NULL values. And because any time you are on a machine whose default system code page does not contain all of the characters in question (and as those posts indicated this is not unheard of), the conversion back from Double Secret Unicode will not perfectly round trip back to the original string. And once you get into Chinese, Japanese, and Korean, the exacting requirements of particular lead bytes and trail bytes can cause you to lose ideographs in all kinds of unexpected ways.
In summary, you will lose information.
The way around this is really quite simple. Any time you are going to pass a string to an external function, declare it ByVal As Long rather than ByVal As String, and then pass StrPtr(<your string>) to that Long. For example:
Public Declare Function CompareStringW Lib "kernel32" ( _
ByVal Locale As Long, _
ByVal dwCmpFlags As Long, _
ByVal lpString1 As Long, _
ByVal cchCount1 As Long, _
ByVal lpString1 As Long, _
ByVal cchCount1 As Long) As Long
Public Enum CmpFlags
STRINGSORT = &H1000&
IGNORECASE = &H1&
IGNORENONSPACE = &H2&
IGNORESYMBOLS = &H4&
IGNOREKANATYPE = &H10000
IGNOREWIDTH = &H20000
Public Enum CSTR_
LESS_THAN = 1 'string 1 less than string 2
EQUAL = 2 ' string 1 equal to string 2
GREATER_THAN = 3 ' string 1 greater than string 2
Public Function CompareString(ByVal lcid As Long, ByVal flags As CmpFlags, ByVal st1 As String, ByVal st2 As String) As CSTR_
CompareString = CompareStringW(lcid, flags, StrPtr(st1), Len(st1), StrPtr(st2), Len(st2))
And if you use this technique, the call will be faster (you avoid two extraneous conversions) and you will never lose data from conversion mistakes.
Best of all, you avoid the evil Double Secret Unicode!
This post brought to you by "٭" (U+066d, a.k.a. ARABIC FIVE POINTED STAR)
# Gabe on 28 Oct 2005 3:13 AM:
# Vorn on 28 Oct 2005 3:15 AM:
# Michael S. Kaplan on 28 Oct 2005 3:26 AM:
# Nick Lamb on 28 Oct 2005 9:33 AM:
# Mike on 28 Oct 2005 9:53 AM:
# Vorn on 28 Oct 2005 3:26 PM:
# robert on 29 Oct 2005 4:12 PM:
# Michael S. Kaplan on 29 Oct 2005 4:34 PM:
# robert on 30 Oct 2005 11:23 AM:
# Michael S. Kaplan on 30 Oct 2005 2:32 PM:
2006/10/08 The return of double secret Unicode!
2006/06/08 DEP is not affected by locale settings
go to newer or older post, or back to index or month or day