The difference between 'Dangeous Characters' and 'Dangerous Minds' is the lack of Michelle Pfeiffer

by Michael S. Kaplan, published on 2007/06/12 08:35 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2007/06/12/3251648.aspx


In internationalization contexts, one often hears about the notion of dangerous characters.

This is not (as it may sound) about the criminal element, but rather about specific Unicode characters that can cause problems if their consequences are not taken into account.

Here is (for example) what one of the resources provides (I got this from Gwyneth Marshall from over in Office:

Language

CodePage

‘Problem’ characters

Comments

German

1252

ß (U+00DF)

 

French

1252

æ (ALT+0230), Æ (ALT+0198)

œ (ALT+0156), Œ (ALT+0140)

ç (ALT+0231), Ç (ALT+0199)

î (ALT+0238), Î (ALT+0206)

The æ chars are not very common so I included a fourth char î.

Spanish

1252

 

 

Italian

1252

Any 3 of  à(À), è(È), é(É), ì(Ì), ò(Ò), ù(Ù)

Some features useful in English are useless for Italian.  Capitalize the letter “i” in “I” is an example of a feature that must be cut off the Italian version of Word.

Swedish

1252

å  (ALT+0229)

ä  (ALT+0228)

ö  (ALT+0246)

 

Brazilian

1252

 

 

Dutch

1252

Any 3 of accented characters (vowels)

With the Belgian Dutch AZERTY keyboard layout some characters (e.g. some of those entered with the AltGr key) are sometimes impossible to make, especially for accelerator keys.

Sub would like some testing of support of this keyboard layout.

Danish

1252

å  (ALT+0229)

æ  (ALT+0230)

ø  (ALT+0248)

Sometimes a problem with æ getting incorrectly seperated to a and e

Norwegian

1252

å  (ALT+0229)

æ  (ALT+0230)

ø  (ALT+0248)

 

Finnish

1252

å  (ALT+0229)

ä  (ALT+0228)

ö  (ALT+0246)

 

Portuguese

1252

Any 3 of  á, é, í, ó, ú, à, ê, ô, ã, ç

None of these extended characters tend to cause major problems, but they should not be used as hot keys.

Czech / Slovak

1250

Š (ALT+0138), š (ALT+ 0154)

Ť (ALT+0141), ť (ALT+0157)

Ž (ALT+0142), ž (ALT+0158)

Characters within the range 0128 to 0159 are often problematic because developers can assume that this range is non-alphanumeric.

Polish

1250

Ś (ALT+0140), ś (ALT+0156)

Ź (ALT+0143), ź (ALT+0159)

This characters are mentioned based on the same reasoning as for Czech.

Hungarian

1250

ő (ALT+0245), Ő (ALT+0213)

ű (ALT+0251), Ű (ALT+0219)

These CE characters are specific to Hungarian.

Slovenian

1250

Č (ALT+0200), č (ALT+0232)

Ž (ALT+0142), ž (ALT+0158)

Š (ALT+0138), š (ALT+0154)

 

Russian

1251

я (ALT+0255)

Ч (ALT+0215), ч (ALT+0247)

Ё (ALT+0168), ё (ALT+0184)

р (ALT+0240)

Because it is the last letter in the codepage.

Because CP 1252 has multiplication and division in these places.

Because these letters are outside the main range of Russian letters.

 

Greek

1253

Σ (ALT+0211) – capital letter sigma

σ (ALT+0243) – small letter sigma

ς (ALT+0242) – small letter final sigma

Any of Greek accented characters

Both small sigma characters capitalize to the same capital letter.

Final sigma only appears at the end of a word.

 

Turkish

1254

ı (ALT+0253),  I (ALT+0073)

i (ALT+0105),  İ (ALT+0221)

ğ (ALT+0240),  Ğ (ALT+0208)

ş (ALT+0254),  Ş (ALT+0222)

Most of these characters are the only ones that are not in CP 1252 but are in 1254.

Possible problems with I would be, setup files staring with I, any registry entries that contain uppercased I, auto upper/lower casing in apps.

Japanese

932

0x5c Characters - ソ十申暴構能

0x5f  Characters (DBCS) - 雲契活神点農

0x7b Characters - ボ施倍府本宮

0x7d Characters - マ笠急党図迎

0x7e Characters - ミ円救降冬梅

0x5b Characters - ゼ夕票充端納

0x5d Characters - ゾ従転脳評競

0xe5 Characters - 怜蒟栁ょ溷瑯

(U+745e)

(U+30BD)

0x5c characters are the most problematic ones

 

 

 

 

 

 

Lead and trail byte are identical.

Full-width Katakana or DBCS

Chinese (Simplified)

936

0x5c Characters - 僜刓嘰塡奬媆孿
DBCS Alphabet - abcdABCD
Random DBCS Characters - 偁偄偙
0x5f Characters - 乢猒峗芲
Mixed SBCS and DBCS Characters - 偁A偄E偆I偊O偍U

Chinese (Traditional)

950

Boundary Characters (First Plane) - 一才中丙禳讒讖籲
Boundary Characters (Second Plane) - 乂氕氶汋纘鼊龤
Mixed SBCS and DBCS - 牷A礜I略U礎E漼O
0x5c Characters (trailing bytes are the path delimiter "\" character) - 尐赨塿槙箤踊
Double-byte alphabet - abcABC
0x5f (DOS reserved char.) - 巢巢巢
0x7c (DOS reserved char.) - 悴矱悴矱
0xe5 (DOS char. used at beginning of file name to indicate that file has been deleted) - 勗脣勗脣

Korean

949

Korean DBCS characters - 가나다라똠푱뜡옺갂韓國
DBCS alphabet - abcABCD
Mixed SBCS and DBCS - 아A이E우I에O오U
Boundary characters (0x8141, 0xFDFE)  - 갂갂갂詰詰詰
0xE5 (lead byte) - 夜野女語
0xA1A1 (Double-byte space) - 詰 갂

Thai

874

คำ

Two characters to form one character. Two issues to watch for:
- Display. The circle should be on top of the character, not off to the right.
- Caret movement. One click should jump over the entire cluster, not two clicks.

Vietnamese

1258

Ấ Tall character
 

The entry that I found most interesting for the purposes of today's post is the one on Greek:

Greek

1253

Σ (ALT+0211) – capital letter sigma

σ (ALT+0243) – small letter sigma

ς (ALT+0242) – small letter final sigma

Any of Greek accented characters

Both small sigma characters capitalize to the same capital letter.

Final sigma only appears at the end of a word.

 

Now this is something I have talked about before, in the following posts:

I ended up having a few conversations with people about what specific circumstances would make these characters dangerous, especially in light of the information in the above posts.

The answer I got was fascinating, and it is something I have often run across many times in code in the past....

For some reason, many developers prefer to handle case insensitive comparisons using the same Change Case and do a Binary Comparison that is the methodology used in OrdinalIgnoreCase and NTFS style comparisons. And they often roll their own code here that lowercases rather than uppercases (or they use the CRT functions that lowercase rather than uppercase).

So what happens if one is trying to validate file paths and one uses a convert to lower case and then do a binary comparison style operation? One gains an extra character (U+03c2, a.k.a. GREEK SMALL LETTER FINAL SIGMA), and any file path validation one does will not match the actual file system.

Now I do agree with the decision to case the Greek script as happens now in Microsoft products for the reasons discussed in The last word on the FINAL SIGMA. But it is hard to get away from the fact that many developers run into problems here because they are either doing the wrong thing (in which case they are to blame) or because the CRT is doing the wrong thing (in which case one can blame the forces in the universe that conspire to do something in international standards that is not done by Microsoft.

I think I'll take a much wider view and perhaps blame the original decision in so much of Windows to support case insensitivity by uppercasing. Why on earth didn't they lowercase here? It would have made everything much easier, and then U+03c2, (GREEK SMALL LETTER FINAL SIGMA) wouldn't have to be a dangerous character....

Probably too late to do anything about it (though it is tempting to try to change Windows to lowercase for it's case-insensitive binary comparisons and see what breaks!). We'll just have to live with the dangerous nature of this character.

Or maybe encode a GREEK CAPITAL LETTER FINAL SIGMA in Unicode; the fact that no such character hasn't stopped us in the past; why let it stop us now? :-)

 

This post brought to you by ς (U+03c2, a.k.a. GREEK SMALL LETTER FINAL SIGMA)


# Abeywickrama on Thursday, June 14, 2007 1:27 AM:

Michael, I  was in Turkey and could not log in to the Google account from Hotel PC. Latter I found the reason was that my username had letter 'i' in it. In Turkish keyboard the letter i [which is generally between u & o] represent a diffrent 'i' than what we use.   But I found letter i in a different place.


referenced by

2010/10/06 ...and the keyboard layouts attached to them in particular

go to newer or older post, or back to index or month or day