'universal-character-name encountered in source'

by Michael S. Kaplan, published on 2006/04/21 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/04/21/580316.aspx


The compiler error is C4428: universal-character-name encountered in source. The description of it in MSDN has a few problems in regard to understandability:

Visual C++ Concepts: Building a C/C++ Program
Compiler Warning (level 4) C4428  
Error Message
universal-character-name encountered in source
The compiler issues C4428 when it detects at least one universal character name in a source code file. To fix this warning, use the Unicode equivalent of the universal character name.
This warning is only issued once per compiland.
The following sample generates C4428:

  // C4428.cpp
  // compile with: /W4 /c
  int \u20ac = 0;  // C4428 universal character name
  /// The following line is the Unicode equivalent of \u20ac:
  // int  = 0;

Now we'll start with the code sample that won't compile -- obviously they meant put "// int € = 0;" but there is no character there -- it is just a space.

And of course there is the unclear error message -- 'universal-character-name encountered in source' -- as if including a Unicode UTF-16 code unit in source has anything whatsoever to do with character names in either Unicode or ISO 10616.

Moving deeper, a currency symbol like the Euro is a symbol (just like the space, as I pointed out here) so if using a random Unicode character with a Unicode General Category of Sc (Symbol, Currency) is a valid variable name then that is just poor compiler design (and not conformant with the current design for identifiers being considered by the C/C++ committees). Plus if you look at the current rules for identifiers in Microsoft C/C++:

identifier:
nondigitidentifier nondigitidentifier digit

nondigit: one of
_ a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

digit: one of
0 1 2 3 4 5 6 7 8 9

So unless I am missing something you can't have U+20ac as a variable name anyway right now in C or C++....

Ok, moving past the terrible example (broken on the levels of both saying the wrong thing and saying it poorly), there is having this warning at all.

Why is the \u syntax worthy of level 4 warning? It is actually something perfectly acceptable to do, and unless you are a native speaker of a language it is a hell of a lot more useful than putting random characters into source code.

I have a colleague who made a huge push to get our piece of the Windows project compiling with /w4 and I'd feel like that kind of a push was a lot 'symbolically' more useful if the compiler stuck with the warnings that had a real purpose.

This one is only slightly more useful than a warning for identifiers with an odd number of characters in it!

Anyway, do not feel badly about supressing this warning -- it is begging for supression, truly.

 

This post brought to you by "€" (U+20ac, a.k.a. EURO SIGN)


# Andrew West on 21 Apr 2006 5:35 AM:

So if I try to compile

int € = 0;

in a C++ program with Visual Studio 7.1 I get

error C3209: ' ' : Unicode identifiers are not yet supported

and the documentation for this error implies that you should use UCNs. And sure enough the following line compiles OK.

int \u20ac = 0;

However I do not get the C4428 warning (even with Warning Level 4 set), and C4428 is not even in my local copy of MSDN, so I guess it must be a new (and really stupid) warning.

But one thing that I did notice is that the '€' character is replaced by '?' in the following (expected) warning:

warning C4189: '?' : local variable is initialized but not referenced

I think that it is pretty lame that the build log cannot cope with Unicode (the HTML version explicitly uses Windows-1252), when all the other Visual Studio components support Unicode. Do you know if that will be fixed soon?

# Michael S. Kaplan on 21 Apr 2006 8:58 AM:

The error is new for 8.0, so I suspect the behavior will change slightly with the upgrade....

Of course, since identifiers still cannot be outside of that ASCII range, the C3209 will still be there. And the build log being fixed may have to wait for the point where the identifiers are actuaslly supported, since updating the log without that functionality is pretty limited?

# Maurits [MSFT] on 21 Apr 2006 11:31 AM:

> identifier: nondigitidentifier nondigitidentifier digit

This is, obviously, horribly wrong.

It should be:

identifier:
.. identifier-nondigit
.. identifier identifier-nondigit
.. identifier digit

# Maurits [MSFT] on 21 Apr 2006 11:34 AM:

Looks like missing linebreaks are the culprit.
Put linebreaks where the *s are and it's right:

identifier:*nondigit*identifier nondigit*identifier digit

# Maurits [MSFT] on 21 Apr 2006 11:58 AM:

On the "symbol" note...

int _ = 0;

also compiles, and fits the syntax.

_ isn't a "symbol", but it is "punctuation"
http://www.fileformat.info/info/unicode/category/Pc/list.htm

Luckily, it's a special "connector" kind of punctuation (unlike, say, ".", which is just "Punctuation, Other")

Even more luckily, all other "Punctuation, Connector" characters look suitable for use in identifiers...

So perhaps identifier-digit and identifier-nondigit can be determined purely in terms of Unicode categories!

(With identifier-nondigit including [Pc] purely for the benefit of "_" and co.)

# Maurits [MSFT] on 21 Apr 2006 12:12 PM:

This page gets the line breaks right, thankfully.

http://msdn2.microsoft.com/en-us/library/e7f8y25b.aspx

# Mihai on 21 Apr 2006 12:46 PM:

"So unless I am missing something you can't have U+20ac as a variable name anyway right now in C or C++...."

You might be right about MS C++, but not about C++ standard:
  identifier::= nondigit #( nondigit | digit).
  nondigit::= universal_character_name | ASCII.letter | ASCII.underscore.
  universal_character_name::= backslash "u" hex_quad | backslash "U" hex_quad^2.

This means \u20ac or \U000020ac are fine.

I was able to find something here: http://www.csci.csusb.edu/dick/c++std/syntax.html
(for some reason the standard itself is not available online, and I cannot post Stroustrup's hardcover edition of "The C++ Programming Language" :-)

# Maurits [MSFT] on 21 Apr 2006 1:54 PM:

> for some reason the standard itself is not available online

That's because the ISO likes to charge for their standards.

The latest C++ standard is CHF 352,00

http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=38110&ICS1=35&ICS2=60&ICS3=

# Michael S. Kaplan on 21 Apr 2006 2:06 PM:

I was indeed referring to MS-specific C/C++ (and the whole post is about an MS-specific compile error and its troubled documentation....

# Maurits [MSFT] on 21 Apr 2006 2:15 PM:

> ISO likes to charge for their standards

Drafts, on the other hand, are usually made publically available for the purposes of review.

For example, C0x is available here:
http://c0x.coding-guidelines.com/

See in particular
http://c0x.coding-guidelines.com/6.4.2.1.html
http://c0x.coding-guidelines.com/6.4.3.html

including a list of disallowed code points

# Michael Dunn_ on 21 Apr 2006 5:16 PM:

On a sort-of-related topic, try this in VC:

int $a = 1;

I got a laugh out of that a couple years ago when I accidentally discovered that VC allows $ in variable names. Make your C look like Perl! (I found that using VC6, not sure if later versions accept it.)

# Maurits [MSFT] on 21 Apr 2006 5:29 PM:

Ah, so that's why 6.4.3 allows $, @, and ` in \u-ish code points

I wonder how far this can be pushed...

/* perl.h */

#define my int
#define use using
#define elsif else if
#define until(x) while(!(x))
...

# Mike Dimmick on 21 Apr 2006 7:02 PM:

Maurits: if you want the C++ standard you can buy it from ANSI, in PDF format, for $30. It's gone up a bit, it was $18 when I bought my copy. You don't need to be based in the US.

http://webstore.ansi.org/ansidocstore/product.asp?sku=INCITS%2FISO%2FIEC+14882%2D2003

# Maurits [MSFT] on 21 Apr 2006 7:51 PM:

Sweet... that's about 90% off the ISO price.

# josh on 22 Apr 2006 12:41 AM:

Chars like $ and @ are often used in generated code, particularly for C where you can't just namespace things off.

go to newer or older post, or back to index or month or day