Sometimes you drop the BOM, and sometimes the BOM drops you!

by Michael S. Kaplan, published on 2007/06/26 01:16 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/06/25/3537065.aspx

Back when I posted Your VC++ files don't support Unicode identifers? Drop a BOM on them!, I promised I'd say more about how Microsoft's Visual C++ was deciding what parts of Unicode could be used as valid identifiers.

And if you'll recall, I talked about how

...one should not assume that the full support of Unicode Standard Annex #31 (Identifier and Pattern Syntax) is being implemented, but hopefully some not entirely incompatible subset is what would turn out to be available....

Well, consider my hopes dashed at this point....

I'll include the code comment above the validation function, though it is my heartfelt desire to see this code ripped out with extreme prejudice in some future version. :-)

//
// Validate Unicode Identifier
//
// Every symbol of an identifer can belong to one of the following two
// disjoint sets of characters:
//
// 1. Characters from the basic character set. We don't have to enforce it here,
// and we also know that such characters cannot be entered as UCNs;
//
// 2. Characters greater than 0x80 encoded either directly or as UCNs. The exact ranges
// are specified by Annex E in C++ Standard or by a superset of it in clr. It is not
// implemented yet, but most likely we will decide to go with the clr range.
//

Of course currently the code does not do anything in #2 (as the comment indicates); instead for #2 it just relies heavily on the CRT iswspace function, which of course relies heavily on the NLS GetStringTypeW function looking for those C1_SPACE characters (which of course means that the compilation behavior can be OS-dependent as new Unicode versions are supported.

It also means there are a lot of very silly characters that could be chosen to be identifiers right now (including lots of undefined ones and lots of weird symbols).

I mean, imagine code something like this:

if (≤ < ≲) {
≥();
}

and so on....

Like I said, future versions should really be a lot more reasonable in this regard. And they will if I have any say in the matter at all.

So please don't get to enjoying this too much....

This post brought to you by ≲ (U+2272, a.k.a. LESS-THAN OR EQUIVALENT TO)

Dean Harding on 26 Jun 2007 2:19 AM:

I guess it also means it suffers from the same problem that C# does, whereby you could have 7 different functions called "àáâãäå" differing only by normalization :-)

Michael S. Kaplan on 26 Jun 2007 5:19 AM:

Well, that is technically a feature (for obfuscation engines), one that is possibly expanded in C++. :-)

Erzengel on 27 Jun 2007 3:24 PM:

#define ≤ <=

#define ≥ >=

Now we just need a way to write those easily. :-)

Jan Kucera on 29 Jun 2007 5:04 AM:

Erzengel: Autocorrect? :)

But I too would welcome to be able to use ≤, ≥, ≠, ∞, ... in code syntax. How cool would be to define custom operators resulting in

if (myElement ∉ mySet) ... I always wished that!

Though, if (myVar = ±2) might be quite challenging to get work :)

Michael, so you don't recommend to name functions ∛ or similar?

Michael S. Kaplan on 29 Jun 2007 8:56 AM:

Nope. I only recommend using characters in names that follow the more sensible rules of the UAX, in anticipation of conformance in the future. :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day