by Michael S. Kaplan, published on 2007/06/12 08:35 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2007/06/12/3251648.aspx
In internationalization contexts, one often hears about the notion of dangerous characters.
This is not (as it may sound) about the criminal element, but rather about specific Unicode characters that can cause problems if their consequences are not taken into account.
Here is (for example) what one of the resources provides (I got this from Gwyneth Marshall from over in Office:
Language |
CodePage |
‘Problem’ characters |
Comments |
German |
1252 |
ß (U+00DF) |
|
French |
1252 |
æ (ALT+0230), Æ (ALT+0198) œ (ALT+0156), Œ (ALT+0140) ç (ALT+0231), Ç (ALT+0199) î (ALT+0238), Î (ALT+0206) |
The æ chars are not very common so I included a fourth char î. |
Spanish |
1252 |
|
|
Italian |
1252 |
Any 3 of à(À), è(È), é(É), ì(Ì), ò(Ò), ù(Ù) |
Some features useful in English are useless for Italian. Capitalize the letter “i” in “I” is an example of a feature that must be cut off the Italian version of Word. |
Swedish |
1252 |
å (ALT+0229) ä (ALT+0228) ö (ALT+0246) |
|
Brazilian |
1252 |
|
|
Dutch |
1252 |
Any 3 of accented characters (vowels) |
With the Belgian Dutch AZERTY keyboard layout some characters (e.g. some of those entered with the AltGr key) are sometimes impossible to make, especially for accelerator keys. Sub would like some testing of support of this keyboard layout. |
Danish |
1252 |
å (ALT+0229) æ (ALT+0230) ø (ALT+0248) |
Sometimes a problem with æ getting incorrectly seperated to a and e |
Norwegian |
1252 |
å (ALT+0229) æ (ALT+0230) ø (ALT+0248) |
|
Finnish |
1252 |
å (ALT+0229) ä (ALT+0228) ö (ALT+0246) |
|
Portuguese |
1252 |
Any 3 of á, é, í, ó, ú, à, ê, ô, ã, ç |
None of these extended characters tend to cause major problems, but they should not be used as hot keys. |
Czech / Slovak |
1250 |
Š (ALT+0138), š (ALT+ 0154) Ť (ALT+0141), ť (ALT+0157) Ž (ALT+0142), ž (ALT+0158) |
Characters within the range 0128 to 0159 are often problematic because developers can assume that this range is non-alphanumeric. |
Polish |
1250 |
Ś (ALT+0140), ś (ALT+0156) Ź (ALT+0143), ź (ALT+0159) |
This characters are mentioned based on the same reasoning as for Czech. |
Hungarian |
1250 |
ő (ALT+0245), Ő (ALT+0213) ű (ALT+0251), Ű (ALT+0219) |
These CE characters are specific to Hungarian. |
Slovenian |
1250 |
Č (ALT+0200), č (ALT+0232) Ž (ALT+0142), ž (ALT+0158) Š (ALT+0138), š (ALT+0154) |
|
Russian |
1251 |
я (ALT+0255) Ч (ALT+0215), ч (ALT+0247) Ё (ALT+0168), ё (ALT+0184) р (ALT+0240) |
Because it is the last letter in the codepage. Because CP 1252 has multiplication and division in these places. Because these letters are outside the main range of Russian letters.
|
Greek |
1253 |
Σ (ALT+0211) – capital letter sigma σ (ALT+0243) – small letter sigma ς (ALT+0242) – small letter final sigma Any of Greek accented characters |
Both small sigma characters capitalize to the same capital letter. Final sigma only appears at the end of a word.
|
Turkish |
1254 |
ı (ALT+0253), I (ALT+0073) i (ALT+0105), İ (ALT+0221) ğ (ALT+0240), Ğ (ALT+0208) ş (ALT+0254), Ş (ALT+0222) |
Most of these characters are the only ones that are not in CP 1252 but are in 1254. Possible problems with I would be, setup files staring with I, any registry entries that contain uppercased I, auto upper/lower casing in apps. |
Japanese |
932 |
0x5c Characters - ソ十申暴構能 0x5f Characters (DBCS) - 雲契活神点農 0x7b Characters - ボ施倍府本宮 0x7d Characters - マ笠急党図迎 0x7e Characters - ミ円救降冬梅 0x5b Characters - ゼ夕票充端納 0x5d Characters - ゾ従転脳評競 0xe5 Characters - 怜蒟栁ょ溷瑯 瑞 (U+745e) ソ (U+30BD) |
0x5c characters are the most problematic ones
Lead and trail byte are identical. Full-width Katakana or DBCS |
Chinese (Simplified) |
936 |
0x5c Characters - 僜刓嘰塡奬媆孿 DBCS Alphabet - abcdABCD Random DBCS Characters - 偁偄偙 0x5f Characters - 乢猒峗芲 Mixed SBCS and DBCS Characters - 偁A偄E偆I偊O偍U |
|
Chinese (Traditional) |
950 |
Boundary Characters (First Plane) - 一才中丙禳讒讖籲 Boundary Characters (Second Plane) - 乂氕氶汋纘鼊龤 Mixed SBCS and DBCS - 牷A礜I略U礎E漼O 0x5c Characters (trailing bytes are the path delimiter "\" character) - 尐赨塿槙箤踊 Double-byte alphabet - abcABC 0x5f (DOS reserved char.) - 巢巢巢 0x7c (DOS reserved char.) - 悴矱悴矱 0xe5 (DOS char. used at beginning of file name to indicate that file has been deleted) - 勗脣勗脣 |
|
Korean |
949 |
Korean DBCS characters - 가나다라똠푱뜡옺갂韓國 DBCS alphabet - abcABCD Mixed SBCS and DBCS - 아A이E우I에O오U Boundary characters (0x8141, 0xFDFE) - 갂갂갂詰詰詰 0xE5 (lead byte) - 夜野女語 0xA1A1 (Double-byte space) - 詰 갂 |
|
Thai |
874 |
คำ |
Two characters to form one character. Two issues to watch for: |
Vietnamese |
1258 |
Ấ | Tall character |
The entry that I found most interesting for the purposes of today's post is the one on Greek:
Greek |
1253 |
Σ (ALT+0211) – capital letter sigma σ (ALT+0243) – small letter sigma ς (ALT+0242) – small letter final sigma Any of Greek accented characters |
Both small sigma characters capitalize to the same capital letter. Final sigma only appears at the end of a word. |
Now this is something I have talked about before, in the following posts:
I ended up having a few conversations with people about what specific circumstances would make these characters dangerous, especially in light of the information in the above posts.
The answer I got was fascinating, and it is something I have often run across many times in code in the past....
For some reason, many developers prefer to handle case insensitive comparisons using the same Change Case and do a Binary Comparison that is the methodology used in OrdinalIgnoreCase and NTFS style comparisons. And they often roll their own code here that lowercases rather than uppercases (or they use the CRT functions that lowercase rather than uppercase).
So what happens if one is trying to validate file paths and one uses a convert to lower case and then do a binary comparison style operation? One gains an extra character (U+03c2, a.k.a. GREEK SMALL LETTER FINAL SIGMA), and any file path validation one does will not match the actual file system.
Now I do agree with the decision to case the Greek script as happens now in Microsoft products for the reasons discussed in The last word on the FINAL SIGMA. But it is hard to get away from the fact that many developers run into problems here because they are either doing the wrong thing (in which case they are to blame) or because the CRT is doing the wrong thing (in which case one can blame the forces in the universe that conspire to do something in international standards that is not done by Microsoft.
I think I'll take a much wider view and perhaps blame the original decision in so much of Windows to support case insensitivity by uppercasing. Why on earth didn't they lowercase here? It would have made everything much easier, and then U+03c2, (GREEK SMALL LETTER FINAL SIGMA) wouldn't have to be a dangerous character....
Probably too late to do anything about it (though it is tempting to try to change Windows to lowercase for it's case-insensitive binary comparisons and see what breaks!). We'll just have to live with the dangerous nature of this character.
Or maybe encode a GREEK CAPITAL LETTER FINAL SIGMA in Unicode; the fact that no such character hasn't stopped us in the past; why let it stop us now? :-)
This post brought to you by ς (U+03c2, a.k.a. GREEK SMALL LETTER FINAL SIGMA)
# Abeywickrama on Thursday, June 14, 2007 1:27 AM:
Michael, I was in Turkey and could not log in to the Google account from Hotel PC. Latter I found the reason was that my username had letter 'i' in it. In Turkish keyboard the letter i [which is generally between u & o] represent a diffrent 'i' than what we use. But I found letter i in a different place.
referenced by