The difference between 'Dangeous Characters' and 'Dangerous Minds' is the lack of Michelle Pfeiffer

by Michael S. Kaplan, published on 2007/06/12 08:35 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2007/06/12/3251648.aspx

In internationalization contexts, one often hears about the notion of dangerous characters.

This is not (as it may sound) about the criminal element, but rather about specific Unicode characters that can cause problems if their consequences are not taken into account.

Here is (for example) what one of the resources provides (I got this from Gwyneth Marshall from over in Office:

Language	CodePage	‘Problem’ characters	Comments
German	1252	ß (U+00DF)
French	1252	æ (ALT+0230), Æ (ALT+0198) œ (ALT+0156), Œ (ALT+0140) ç (ALT+0231), Ç (ALT+0199) î (ALT+0238), Î (ALT+0206)	The æ chars are not very common so I included a fourth char î.
Spanish	1252
Italian	1252	Any 3 of à(À), è(È), é(É), ì(Ì), ò(Ò), ù(Ù)	Some features useful in English are useless for Italian. Capitalize the letter “i” in “I” is an example of a feature that must be cut off the Italian version of Word.
Swedish	1252	å (ALT+0229) ä (ALT+0228) ö (ALT+0246)
Brazilian	1252
Dutch	1252	Any 3 of accented characters (vowels)	With the Belgian Dutch AZERTY keyboard layout some characters (e.g. some of those entered with the AltGr key) are sometimes impossible to make, especially for accelerator keys. Sub would like some testing of support of this keyboard layout.
Danish	1252	å (ALT+0229) æ (ALT+0230) ø (ALT+0248)	Sometimes a problem with æ getting incorrectly seperated to a and e
Norwegian	1252	å (ALT+0229) æ (ALT+0230) ø (ALT+0248)
Finnish	1252	å (ALT+0229) ä (ALT+0228) ö (ALT+0246)
Portuguese	1252	Any 3 of á, é, í, ó, ú, à, ê, ô, ã, ç	None of these extended characters tend to cause major problems, but they should not be used as hot keys.
Czech / Slovak	1250	Š (ALT+0138), š (ALT+ 0154) Ť (ALT+0141), ť (ALT+0157) Ž (ALT+0142), ž (ALT+0158)	Characters within the range 0128 to 0159 are often problematic because developers can assume that this range is non-alphanumeric.
Polish	1250	Ś (ALT+0140), ś (ALT+0156) Ź (ALT+0143), ź (ALT+0159)	This characters are mentioned based on the same reasoning as for Czech.
Hungarian	1250	ő (ALT+0245), Ő (ALT+0213) ű (ALT+0251), Ű (ALT+0219)	These CE characters are specific to Hungarian.
Slovenian	1250	Č (ALT+0200), č (ALT+0232) Ž (ALT+0142), ž (ALT+0158) Š (ALT+0138), š (ALT+0154)
Russian	1251	я (ALT+0255) Ч (ALT+0215), ч (ALT+0247) Ё (ALT+0168), ё (ALT+0184) р (ALT+0240)	Because it is the last letter in the codepage. Because CP 1252 has multiplication and division in these places. Because these letters are outside the main range of Russian letters.
Greek	1253	Σ (ALT+0211) – capital letter sigma σ (ALT+0243) – small letter sigma ς (ALT+0242) – small letter final sigma Any of Greek accented characters	Both small sigma characters capitalize to the same capital letter. Final sigma only appears at the end of a word.
Turkish	1254	ı (ALT+0253), I (ALT+0073) i (ALT+0105), İ (ALT+0221) ğ (ALT+0240), Ğ (ALT+0208) ş (ALT+0254), Ş (ALT+0222)	Most of these characters are the only ones that are not in CP 1252 but are in 1254. Possible problems with I would be, setup files staring with I, any registry entries that contain uppercased I, auto upper/lower casing in apps.
Japanese	932	0x5c Characters - ソ十申暴構能 0x5f Characters (DBCS) - 雲契活神点農 0x7b Characters - ボ施倍府本宮 0x7d Characters - マ笠急党図迎 0x7e Characters - ミ円救降冬梅 0x5b Characters - ゼ夕票充端納 0x5d Characters - ゾ従転脳評競 0xe5 Characters - 怜蒟栁ょ溷瑯瑞 (U+745e) ソ (U+30BD)	0x5c characters are the most problematic ones Lead and trail byte are identical. Full-width Katakana or DBCS
Chinese (Simplified)	936	0x5c Characters - 僜刓嘰塡奬媆孿 DBCS Alphabet - ａｂｃｄＡＢＣＤ Random DBCS Characters - 偁偄偙 0x5f Characters - 乢猒峗芲 Mixed SBCS and DBCS Characters - 偁A偄E偆I偊O偍U
Chinese (Traditional)	950	Boundary Characters (First Plane) - 一才中丙禳讒讖籲 Boundary Characters (Second Plane) - 乂氕氶汋纘鼊龤 Mixed SBCS and DBCS - 牷A礜I略U礎E漼O 0x5c Characters (trailing bytes are the path delimiter "\" character) - 尐赨塿槙箤踊 Double-byte alphabet - ａｂｃＡＢＣ 0x5f (DOS reserved char.) - 巢巢巢 0x7c (DOS reserved char.) - 悴矱悴矱 0xe5 (DOS char. used at beginning of file name to indicate that file has been deleted) - 勗脣勗脣
Korean	949	Korean DBCS characters - 가나다라똠푱뜡옺갂韓國 DBCS alphabet - ａｂｃＡＢＣＤ Mixed SBCS and DBCS - 아A이E우I에O오U Boundary characters (0x8141, 0xFDFE) - 갂갂갂詰詰詰 0xE5 (lead byte) - 夜野女語 0xA1A1 (Double-byte space) -　詰　갂
Thai	874	คำ	Two characters to form one character. Two issues to watch for: - Display. The circle should be on top of the character, not off to the right. - Caret movement. One click should jump over the entire cluster, not two clicks.
Vietnamese	1258	Ấ	Tall character

The entry that I found most interesting for the purposes of today's post is the one on Greek:

Greek

1253

Σ (ALT+0211) – capital letter sigma

σ (ALT+0243) – small letter sigma

ς (ALT+0242) – small letter final sigma

Any of Greek accented characters

Both small sigma characters capitalize to the same capital letter.

Final sigma only appears at the end of a word.

I ended up having a few conversations with people about what specific circumstances would make these characters dangerous, especially in light of the information in the above posts.

The answer I got was fascinating, and it is something I have often run across many times in code in the past....

For some reason, many developers prefer to handle case insensitive comparisons using the same Change Case and do a Binary Comparison that is the methodology used in OrdinalIgnoreCase and NTFS style comparisons. And they often roll their own code here that lowercases rather than uppercases (or they use the CRT functions that lowercase rather than uppercase).

So what happens if one is trying to validate file paths and one uses a convert to lower case and then do a binary comparison style operation? One gains an extra character (U+03c2, a.k.a. GREEK SMALL LETTER FINAL SIGMA), and any file path validation one does will not match the actual file system.

Now I do agree with the decision to case the Greek script as happens now in Microsoft products for the reasons discussed in The last word on the FINAL SIGMA. But it is hard to get away from the fact that many developers run into problems here because they are either doing the wrong thing (in which case they are to blame) or because the CRT is doing the wrong thing (in which case one can blame the forces in the universe that conspire to do something in international standards that is not done by Microsoft.

I think I'll take a much wider view and perhaps blame the original decision in so much of Windows to support case insensitivity by uppercasing. Why on earth didn't they lowercase here? It would have made everything much easier, and then U+03c2, (GREEK SMALL LETTER FINAL SIGMA) wouldn't have to be a dangerous character....

Probably too late to do anything about it (though it is tempting to try to change Windows to lowercase for it's case-insensitive binary comparisons and see what breaks!). We'll just have to live with the dangerous nature of this character.

Or maybe encode a GREEK CAPITAL LETTER FINAL SIGMA in Unicode; the fact that no such character hasn't stopped us in the past; why let it stop us now? :-)

This post brought to you by ς (U+03c2, a.k.a. GREEK SMALL LETTER FINAL SIGMA)

Michael, I was in Turkey and could not log in to the Google account from Hotel PC. Latter I found the reason was that my username had letter 'i' in it. In Turkish keyboard the letter i [which is generally between u & o] represent a diffrent 'i' than what we use. But I found letter i in a different place.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

The difference between 'Dangeous Characters' and 'Dangerous Minds' is the lack of Michelle Pfeiffer

Language

CodePage

‘Problem’ characters

Comments