by Michael S. Kaplan, published on 2011/01/21 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/01/21/10118575.aspx
Ryan had a great question yesterday morning:
This function doesn’t seem to always tell the truth – I get compiler errors trying to use the string below as an identifier, but IsValidIdentifier/CreateEscapedIdentifier seem to have no problem with it. Is there a more reliable method to detect when a string has invalid characters for an identifier?
CSharpCodeProvider cscp = new CSharpCodeProvider();
string test = "イゟゝだヴゝヺドダぐるフもむ";
Console.WriteLine(cscp.IsValidIdentifier(test)); // True
Console.WriteLine(cscp.CreateEscapedIdentifier(test) == test); // Also true
Ok, try it:
Hmm, that's weird.
I mean, being told you have a valid identifier and then having it not work out as an identifier can be a little unsettling.
Wolf was able to point out some details on the specific character:
The problematic character (ゟ) is Unicode U+309F, “Hiragana digraph yori”. It’s a “vertical contraction” of (よ) and (り). I suspect it’s a late Unicode addition.
And Eric Lippert then gave the full answer soon after:
Yori was added in Unicode version 3.2.0, March 2002. The previous letter, KATAKANA LETTER I (U+30A4) was added in Unicode version 1.1.0, June 1993.
Odds are good that the (written in native C++) compiler has baked into it an extremely old version of the Unicode “is this a letter?” algorithm. The (written in C#) CSharpCodeProvider is apparently using a more recent version of that algorithm.
It seems to me the fact that we may be using an almost ten-year-old out of date version of the Unicode algorithm in the native compiler could reasonably be characterized as “a bug”. I’ll mention it to QA.
Cheers,
Eric
He got that in one!
I remember when the Yori Digraph was added, and the concern some people had about taking what was wide regarded as a complete set (Hiragana), and adding a new character that would not be in code pages like 932 (Shift-JIS). Those complaints were considered but ultimately rejected, which makes sense given that Unicode has always been a superset of these code pages anyway.
Now there are interesting issues that come up once you start adding version specific knowledge into the world of identifiers, which should make this an interesting one. It will thankfully be made easier by the fact that Unicode never removes characters, though there is no shortage of people who ask for this or that character to be removed because its very existence bothers somebody.
Such changes would break the companies unlike Microsoft that do proper version-specific checking/support of their identifiers, or maybe even one day companies like Microsoft if they more aggressively support such things. :-)
In a future I'll blog I'll dig into the wacky world of identifiers and show that sometimes the only remedy for a Mark is a Ken....
John Cowan on 21 Jan 2011 8:10 AM:
Ooo, I look forward to that!
But "proper version-specific checking"? I don't see any reason to maintain a wall of separation, such that if version 3.2 (say) of your compiler doesn't understand Unicode 6 identifiers, it can *never* understand Unicode 6 even if the surrounding system libraries are updated to do so. If it wants to keep some codepoints valid for backward compatibility, that's fine, although Unicode already does that for the most part.
Michael S. Kaplan on 21 Jan 2011 8:47 AM:
My next blog is going to get into the 'for the most part' piece of what you said. :-)
Michael S. Kaplan on 21 Jan 2011 11:13 AM:
When I say next blog, I don't mean tomorrow, I mean next blog on this issue.
Aaron.E on 21 Jan 2011 1:46 PM:
This is completely tangential, but I can't help noticing how the same handful of individuals often appear as the "question answerers" in blogs that reference a mailing-list or newsgroup discussion. You, Eric, and Raymond seem to be the most prolific, and I was wondering if you have any ideas on why that might be so. My hypotheses are as follows, but I don't have enough evidence to have any confidence in these (though my intuition is that it's the first one):
- Bias on my part: I only remember anecdotes where names are named, and when it's not a blogger the name is not omitted so I don't keep track of it, making bloggers appear to be more active in discussions than everyone else. This could be further enhanced by bloggers' names being hyperlinks, causing them to appear in a different color and therefore be more memorable.
- Blogger Affinity: Bloggers like to blog about other bloggers, and then "reciprocal blogging" creates a feedback loop.
- Common Interest: The bloggers that I find interesting tend to have overlapping areas of interest themselves, so they tend to participate in the same discussions, making them more likely source material for each other. (forgive the psuedo-predicate here)
- Superhero Hypothesis: A small handful of employees, who also happen to have blogs, answer a wildly disproportionate amount of the questions.
Michael S. Kaplan on 21 Jan 2011 1:58 PM:
Probably the 4th point mostly. though I suppose the 3rd point a little too, and maybe the 1st point (thoush that's your biases so you'd have to answer that!). I only fully name people when there is some link (otherwise I use first name or nothing)....
Michael S. Kaplan on 21 Jan 2011 2:06 PM:
But there is that small group of people who tend to be right when they say things, and they are often more likely to be quoted than others. It's a side effect of knowing the people who know their stuff!
Aaron.E on 21 Jan 2011 2:30 PM:
That's very interesting. I was hoping it was the fourth (because it reinforces my decisions to draw influence from the sources I do), but I wasn't honestly expecting that to be the case. Thanks for your insight.
Michael S. Kaplan on 21 Jan 2011 4:13 PM:
(for the record, I don't consider myself to be a superhero!)
John Cowan on 30 Jan 2011 1:22 PM:
Well, another explanation would be to take the S. in your full name, and then the three of you jointly would be Eric S. Raymond.