Converting a project to Unicode: Part 3 (Can I quote you on that?)

by Michael S. Kaplan, published on 2006/12/30 06:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/12/30/1382886.aspx


Previous posts in this series (including today's!):

(If you are just tuning in and want to start now you can grab the current source from here.) 

The biggest source of actual changes in most conversions of legacy projects to Unicode is handling hard-coded strings. The simple fact is that what you might have in your code as

"This is a string"

and which in a purely Unicode application would be

L"This is a string"

now will have to be either

1)        TEXT("This is a string")

or

2)        _TEXT("This is a string")

or

3)        _T("This is a string")

I myself prefer the third one but many people like the first or the second. You can look at the MSDN topic Using Generic-Text Mappings to get more information on _T and _TEXT; the one with no underscore prefix is actually defined in the Platform SDK header file winnt.h and this is the reason why it is used by Windows header files that do not want to include tchar.h in their source files.

(If you are bored, the section of winnt.h with the // Neutral ANSI/UNICODE types and macros comment is where these all are.)

Since we will need tchar.h for a few CRT functions you can pretty much take your pick -- the other reason some people prefer the shorter one is that they consider it less distracting (I consider them all to be about the same in that respect).

I am going to use TEXT() for reasons I will point out shortly.

There are several different ways to approach this kind of change:

  1. Compile the program with UNICODE/_UNICODE tags and catch all of the compile errors that yesterday's changes will cause due to the mismatch between Unicode types and non-Unicode strings
  2. Simply find every occurrence of " (U+0022, a.k.a. QUOTATION MARK) and ' (U+0027, a.k.a. APOSTROPHE) and each time it is appropriate, surround the quoted string with TEXT() or _T()
  3. Use regular expressions to do the find and replace, either using the VS :q (quoted string) or its equivalent in your tool, e.g. (("[^"]*")|('[^']*'))

If you prefer #1, then you may want to skip tody's post and wait for Part 4, tomorrow (which is when we will be doing that). Today is dedicated to taking care of over 100 cases without the complie-time checking....

I tend to prefer the #3 myself, so your find/replace box in VS will look something like this:

The most important things to note are the syntax for tagging an expression (in VS, surround it in curly braces) and then use the tagged expression in the replaced string (in VS, the \# where # is 1-9 which tagged expression to use).

There are many strings you won't want to affect, including obvious ones like

#include "common.h"

and there are even a few "already done" strings like this one from common.h in the source:

#define ISETUPPROPNAME_BASEURL              TEXT("BASEURL")

Note that this code is probably shared with other Windows Installer source code projects (like maybe msistuff.exe's?), which would be why it is written with Unicode in mind even though most of the rest of the project is not. And why it would have a name like common.h.

If this convinces you that you would rather use TEXT() to be able to use the same thing in the rest of the project then like I said you can use whatever you like  (it is what I chose here!).

The other bonus is that it will keep us from having to include tchar.h to files for just this definiton (if you have been trying it you will see that the source is still compiling right now, before e move it to Unicode).

Of course function names are another case where you do not want to wrap them in TEXT macros since the function names will go to GetProcAddress calls. So you would wrap "advapi32.dll" but you would not wrap "CheckTokenMembership" (a function inside advapi32.dll). Though if you mess this up don't worry, it will be a simple compile error later, very easy to fix....

One other interesting string that needs special handling:

"\""

Which we want to become:

TEXT("\"")

and not

TEXT("\")"

obviously. The simple regular expression is not quite smart enough for the escaped quote case (there are like five of these). if you want to try and create a more complex regular expression you are welcome to!

In any case, hopefully I have convinced you that you will definitely want to be careful about your use of Find Next vs. Replace -- and definitely not be tempted by Replace All. :-)

Other things you do not need to "fix" are pretty much anything in the makefile, or anything in a comment (unless you want to amuse future code reviewers).

Now after you go through all of these, you will have noticed that 56 of the strings to edited were calling one of the three overloads of the DebugMsg function found in utils.cpp. I would recommend you go ahead fix them up too, since (a) you have already changed their datatypes anyway, (b) they all call OutputDebugString which will map to OutputDebugStringW after we compile UNICODE, and (c) there is no harm in seeing Unicode text if you run a debugger that supports Unicode. :-)

Amazingly, we are much, much closer now!

We'll do one more big find/replace in today's post. There are several places in the source code where GetProcAddress is being called to get the address of a ANSI function rather than a Unicode one. Let's fix those up right now. You could search for GetProcAddress, but in this project (as in most other projects) it just goes to constants. Just remember (like I said before) -- you always want to make sure that you do not put the TEXT() macro wrapper around funtion names since GetProcAddress's second parameter never expects a Unicode string. You DO weant them around library names and just about everything else.

The easiest way to find all of the occurrences is the following search:

It is pretty rare to ever have a string that ends with a capital A that you wouldn't want to become a capital W, so although you will want to check each one, you are unlikely to have a ton of noise in the results....

Believe it or not we are getting rather close now (tomorrow we're going to take the next step to find major things to look at).

Stay tuned....

 

This post brought to you by (U+1003, a.k.a. MYANMAR LETTER GHA)


Mihai on 30 Dec 2006 5:28 PM:

In the second post you had this:

<<First and foremost, I will not be turning an "A" binary into a "W" binary; the plan is to turn it into a "T" binary a-la TCHAR.H and so on. When I am done I want something that I could compile either way without requiring further code change, if for no other reason than if it did prove to be too difficult and the idea got postponed, I wouldn't have to throw away my changes. :-)>>

So I would not change the names of the GetProcAddress retrieved APIs from A to W, but go with a condition.

Example:

 #ifdef UNICODE

 #define MSIAPI_MsiInstallProduct "MsiInstallProductA"

 #else

 #define MSIAPI_MsiInstallProduct "MsiInstallProductW"

 #endif // UNICODE

Mihai on 30 Dec 2006 5:35 PM:

<<three overloads of the DebugMsg function found in utils.cpp. I would recommend you go ahead fix them up too>>

Reason (d): some calls have Unicode strings as parameters:

 DebugMsg(TEXT("[Info] Downloading msi file %s for WinVerifyTrust check\n"), szInstallPath);

So if the szFormat is not Unicode, but a parameter is, in DebugMsg itself you will either have to convert szFormat to Unicode (allocating memory, and such), or convert the Unicode parameter to ANSI, really not an option.

Having szFormat Unicode to begin with is definitely cleaner.

Michael S. Kaplan on 30 Dec 2006 5:50 PM:

Ah, good point on the function names, I will fix up the code right now. :-)

Michael S. Kaplan on 30 Dec 2006 7:39 PM:

(For what its worth, the particular issue that led to the problem was going to come up in Part 6, and I will still leave the discussion about it in then. The example was less impressive than this one, but a better example never hurts an argument!)

Mike Dimmick on 3 Jan 2007 12:26 PM:

Re: DebugMsg - I haven't downloaded the source and checked, but if it's calling a sprintf()-like function, it's useful to know that:

%s matches string-of-TCHAR

%hs always matches string-of-char

%ls always matches string-of-WCHAR

%hs and %ls will do any necessary conversions for you, albeit using the thread's current default ANSI code page. I believe these are MS CRT extensions.

About 90% of my C++ coding is for Windows CE, which has only ever been UCS-2/UTF-16, but I still code almost all of it using TCHAR and the corresponding pointer types and functions. There are a few places which still use char-based functions, for example the socket functions gethostbyname, inet_addr and inet_ntoa, and of course you regularly have to deal with source data in byte-oriented character sets.

Michael S. Kaplan on 4 Jan 2007 2:36 PM:

Hey Mike!

All of the strings passed to DebugMsg that have string inserts use %s for the strings, so I think we are okay here (the ones that need inserts use StringCchPrintf).


referenced by

2007/12/24 VS just got served!, aka The ??? Shift, aka 'Converting a project to Unicode???' No, it's 'Converting a project??? ToUnicode!!!'

2007/01/05 Converting a project to Unicode: Part 9 (The project's postpartum postmortem)

2007/01/04 Converting a project to Unicode: Part 8 (Fitting MSLU into the mix)

2007/01/03 Converting a Project to Unicode: Part 7 (What does it mean to fit things to a 'T', anyway?)

2007/01/02 Converting a Project to Unicode: Part 6 (Upon the road not traveled)

2007/01/01 Converting a Project to Unicode: Part 5 (Are we there yet? Well, not *just* yet)

2006/12/31 Converting a Project to Unicode: Part 4 (It's /Delightful, it's /Delicious, it's /DUnicode!)

go to newer or older post, or back to index or month or day