Converting a project to Unicode: Part 9 (The project's postpartum postmortem)

by Michael S. Kaplan, published on 2007/01/05 06:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/01/05/1413001.aspx

Previous posts in this series (including today's!):

(If you are just tuning in and want to start now that we are done, you can grab the latest source from here)

If you look at the source, you'll see I chickened out of always adding MSLU to Unicode builds, so there is makefile.mslu and a makefile.uni. :-)

Now that we have gone through and taken an application that is actually useful and converted it to Unicode, I figured for the review it would be good to talk about it a bit.

(I honestly did not look at the code until after deciding to do the series, so this is a true postmortem decision about the effort!)

As projects go, this one was fairly tame, and although there were a few issues that were discussed, it was just a few. Tto compare briefly, the kbdtool.exe --> kbdutool.exe conversion I mentioned back in Part 0 made extensive use of the C Runtime for its extensive file handling and parsing and creating operations. So the single example of strtoul being converted to _tcstoul I taked about in Part 4 would have to be multiplied to the 131 such changes that were required. So the fact is that in the real world of app conversion you could find that the actual effort takes more time even if you do not run into any problems more complex than we dealt with here.

Another interesting comment that was made by Mike Dimmick to Part 3 talked about an issue related to prinft-esque format specifiers, which have outrageous rules in relation to Unicode conversion:

Character Type Output format

c int or wint_t When used with printf functions, specifies a single-byte character; when used with wprintf functions, specifies a wide character.

C int or wint_t When used with printf functions, specifies a wide character; when used with wprintf functions, specifies a single-byte character.

hc, hC int or wint_t Specifies a single-byte character; it is always interpreted as type CHAR, even when the calling application uses the #define UNICODE compile flag.

hs, hS String Specifies a string; it is always interpreted as type LPSTR, even when the calling application uses the #define UNICODE compile flag.

lc, lC int or wint_t Specifies a wide character; it is always interpreted as type WCHAR, even when the calling application does not use the #define UNICODE compile flag.

ls, lS String Specifies a string; it is always interpreted as type LPWSTR, even when the calling application does not use the #define UNICODE compile flag.

s String When used with printf functions, specifies a single-byte–character string; when used with wprintf functions, specifies a wide-character string. Characters are printed up to the first null character or until the precision value is reached.

S String When used with printf functions, specifies a wide-character string; when used with wprintf functions, specifies a single-byte–character

Character	Type	Output format
c	int or wint_t	When used with printf functions, specifies a single-byte character; when used with wprintf functions, specifies a wide character.
C	int or wint_t	When used with printf functions, specifies a wide character; when used with wprintf functions, specifies a single-byte character.
hc, hC	int or wint_t	Specifies a single-byte character; it is always interpreted as type CHAR, even when the calling application uses the #define UNICODE compile flag.
hs, hS	String	Specifies a string; it is always interpreted as type LPSTR, even when the calling application uses the #define UNICODE compile flag.
lc, lC	int or wint_t	Specifies a wide character; it is always interpreted as type WCHAR, even when the calling application does not use the #define UNICODE compile flag.
ls, lS	String	Specifies a string; it is always interpreted as type LPWSTR, even when the calling application does not use the #define UNICODE compile flag.
s	String	When used with printf functions, specifies a single-byte–character string; when used with wprintf functions, specifies a wide-character string. Characters are printed up to the first null character or until the precision value is reached.
S	String	When used with printf functions, specifies a wide-character string; when used with wprintf functions, specifies a single-byte–character

Now I can completely understand why every single one of these format specifiers exist, but you can see why there is a potential for strange results as one moves a project to Unicode, since one is not only dealing with the conversion of the application but in some cases one is dealing with parsing and manipulating data from other sources that may or may not also be converted at the same time.

In our case, the extensive use of formatting strings in the DebugMsg function was alwaysd used by callers with the %s type, so everything worked out. But if you are converting an application that is using anothing other than %c and %s from the above table, one can have a much harder job to decide how to convert the project.

Clearly the project was in many ways written in "the right way" to handle the conversion we did -- note especially the mostly consistent use of sizeof() in character buffer lengths, something often missing -- a fact that only came to bite us in a few specific cases that were clearly written later on by other developers.

Because of such efforts, it is perhaps better to think of the setup bootstrap EXE project as a fair representative of the type of problems one will hit, if not necessarily the magnitude of those problems.

And what has been "delivered" is an EXE that you may well see in the upcoming release of MSKLC. :-)

Now I'll keep my eyes open, and if I run across another example like this of a project to convert that can be shared this way I'd love to do it again some time. I think it would be especially interesting to do one that turns out to be much harder in terms of the amount of effort, just to help give a good sense of how hard people might find the process, in general.

This post brought to you by ᠹ (U+1839, a.k.a. MONGOLIAN LETTER FA)