Suits you to a _T()

by Michael S. Kaplan, published on 2005/08/10 08:51 -04:00, original URI:

The other day, Jeremy asked me:

I thought that with your wealth of unicode knowledge, you may be able to answer a few questions for me.

In a C/C++ program, is it necessary to wrap single character conversions in a _T( ) macro?

For instance...

TCHAR tch = _T('S');

The MSVC compiler happly converts the literal 'S' to a double byte '\0''S' for UNICODE builds, so that the following appears to compile fine...

TCHAR tch = 'S';

I'm unclear if the compiler is actually substituing L'S' for 'S', or simply promoting the value of 'S' to a double byte.

Is there any case in which a single-byte character has a unicode representation that is not a simple double byte promotion?

If you are not dealing with both Unicode and non-Unicode builds of a program, then all of the _T()/TEXT() macro stuff as well as all of the TCHAR stuff is fairly superfluous. As I mentioned a few days ago, new functions NLS adds to Vista are not going to have non-Unicode versions added (a trend started in Server 2003).

To answer the specific question about whether the macros is required (and keeping the last paragraph in mind), I would always suggest using the L prefix on Unicode characters and strings, even though the compiler seems to not feel the need to use it for characters. It is definitely still needed any time you specify a string literal, and the consistency seems like a good thing, doesn't it?

For the ASCII range, you will not find a difference between that "double byte promotion" and a Unicode representation. However, for anything single byte that is outside of ASCII but inside of the default system code page, I would go so far as to say that the "promotion" would usually be wrong, and possibly also subject to different interpretations depending on what the default system code page happens to be. If you are gong to write UNICODE/_UNICODE applications, then it seems best to keep them using Unicode everywhere....


This post brought to you by "S" (U+0053, a.k.a. LATIN CAPITAL LETTER S)


# Richard on 10 Aug 2005 10:01 AM:

In C++ 's' is a char, in C it is an int. L's' in C++ is a wchar_t (not sure about C).

However in C wchar_t is a typedef, but in C++ it is a distinct type. But Visual C++ (unless overridden with a command line/IDE option) it is also a typedef; and char->int is an alloewd conversion.

I wonder if
wchar_t x = 'S'
works OK with wchar_t as a distinct type?

# Dean Harding on 10 Aug 2005 7:20 PM:

I don't have my copy of the beta here, but I'm pretty sure wchar_t is a distinct type by default in VS2005. It was a typedef by default in VS2003 (though it's easy enough to change - usually the first thing I do when starting a new C++ project is set all those 'Treat wchar_t as a Built-In Type', 'Force Conformance In For Loop Scope', etc options to True).

And yeah, I agree with just doing away with all that TCHAR, _T() stuff. It's much simpler to just use wchar_t and S'...' directly, these days. The macros were good when you were doing separate builds for Windows 9X and NT, and didn't want/need MSLU :)

# Scott on 11 Aug 2005 12:29 AM:

I figure that the _T for strings and characters is useful for whenever Unicode 12.7 comes and out and we're all using 5 byte characters. Less code to change - assuming the macros are updated.

# Alexey Logachyov on 12 Aug 2005 11:24 AM:

There's one more thing important. If you use literals from upper half of ASCII table or multibyte characters, be sure to use #pragma setlocale to make compiler convert strings to Unicode using correct code page.

Just today a guy from office opposite to mine had a problem. His SQL query did not work correctly. The query contained Russian letters and it did not fetch any results.

It turned out that Visual Studio was setup to use Russian fonts but system locale was set to English. Compiler converted strings using incorrect code page and Russian string became garbage.

# Michael S. Kaplan on 12 Aug 2005 5:50 PM:

Considering how often people copy/paste code snippets, it might be safer to just use code points (well,code units) in those cases....

# Alexey Logachyov on 13 Aug 2005 4:57 AM:

You mean writing it like this?

CHAR StrA = "\xF1\xEB\xEE\xE2\xEE";
WCHAR StrW = L"\x0441\x043B\x043E\x0432\x043E";

This is a pain for Russian speaking people (like I am). The following snippet looks so much natural to me.

#pragma setlocale("rus")
TCHAR Str = _T("");

# Mike Dunn on 15 Aug 2005 11:15 PM:

WCHAR c = 'S';

works because char->wchar_t is a widening conversion akin to BYTE->WORD or short->long. If you did this:

wcout << 'S';

it won't promote anything.

go to newer or older post, or back to index or month or day