And then nearly ten years later they added another one...

by Michael S. Kaplan, published on 2011/01/21 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/01/21/10118575.aspx

This function doesn’t seem to always tell the truth – I get compiler errors trying to use the string below as an identifier, but IsValidIdentifier/CreateEscapedIdentifier seem to have no problem with it. Is there a more reliable method to detect when a string has invalid characters for an identifier?

CSharpCodeProvider cscp = new CSharpCodeProvider();
string test = "イゟゝだヴゝヺドダぐるフもむ";
Console.WriteLine(cscp.IsValidIdentifier(test)); // True
Console.WriteLine(cscp.CreateEscapedIdentifier(test) == test); // Also true

Ok, try it:

I mean, being told you have a valid identifier and then having it not work out as an identifier can be a little unsettling.

The problematic character (ゟ) is Unicode U+309F, “Hiragana digraph yori”. It’s a “vertical contraction” of (よ) and (り). I suspect it’s a late Unicode addition.

Yori was added in Unicode version 3.2.0, March 2002. The previous letter, KATAKANA LETTER I (U+30A4) was added in Unicode version 1.1.0, June 1993.

Odds are good that the (written in native C++) compiler has baked into it an extremely old version of the Unicode “is this a letter?” algorithm. The (written in C#) CSharpCodeProvider is apparently using a more recent version of that algorithm.

It seems to me the fact that we may be using an almost ten-year-old out of date version of the Unicode algorithm in the native compiler could reasonably be characterized as “a bug”. I’ll mention it to QA.

Cheers,
Eric

I remember when the Yori Digraph was added, and the concern some people had about taking what was wide regarded as a complete set (Hiragana), and adding a new character that would not be in code pages like 932 (Shift-JIS). Those complaints were considered but ultimately rejected, which makes sense given that Unicode has always been a superset of these code pages anyway.

Now there are interesting issues that come up once you start adding version specific knowledge into the world of identifiers, which should make this an interesting one. It will thankfully be made easier by the fact that Unicode never removes characters, though there is no shortage of people who ask for this or that character to be removed because its very existence bothers somebody.

Such changes would break the companies unlike Microsoft that do proper version-specific checking/support of their identifiers, or maybe even one day companies like Microsoft if they more aggressively support such things. :-)

In a future I'll blog I'll dig into the wacky world of identifiers and show that sometimes the only remedy for a Mark is a Ken....

Ooo, I look forward to that!

But "proper version-specific checking"? I don't see any reason to maintain a wall of separation, such that if version 3.2 (say) of your compiler doesn't understand Unicode 6 identifiers, it can *never* understand Unicode 6 even if the surrounding system libraries are updated to do so. If it wants to keep some codepoints valid for backward compatibility, that's fine, although Unicode already does that for the most part.

This is completely tangential, but I can't help noticing how the same handful of individuals often appear as the "question answerers" in blogs that reference a mailing-list or newsgroup discussion. You, Eric, and Raymond seem to be the most prolific, and I was wondering if you have any ideas on why that might be so. My hypotheses are as follows, but I don't have enough evidence to have any confidence in these (though my intuition is that it's the first one):

- Bias on my part: I only remember anecdotes where names are named, and when it's not a blogger the name is not omitted so I don't keep track of it, making bloggers appear to be more active in discussions than everyone else. This could be further enhanced by bloggers' names being hyperlinks, causing them to appear in a different color and therefore be more memorable.

- Blogger Affinity: Bloggers like to blog about other bloggers, and then "reciprocal blogging" creates a feedback loop.

- Common Interest: The bloggers that I find interesting tend to have overlapping areas of interest themselves, so they tend to participate in the same discussions, making them more likely source material for each other. (forgive the psuedo-predicate here)

- Superhero Hypothesis: A small handful of employees, who also happen to have blogs, answer a wildly disproportionate amount of the questions.