The [Upper]Case of the Turkish İ (or: Casing, the 2nd)

by Michael S. Kaplan, published on 2004/12/03 08:13 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2004/12/03/274288.aspx

Wouldn't we do the case mapping to put the dotted and dotless variants together (so that both #1/#4 and #3/#2 would be case pairs)? Be honest, doesn't that make more sense?

We even have a good reason, if you think about it. I mean, its not like the "I" in "him" sounds the one in "nice" and neither of them sounds like the one in "niece" and none of them sounds like the one with no sound in "friend". So with all of those different sounds, English would be a lot simpler if we had an extra pair of letters to work with. I have talked to a lot of native speakers of other languages about languages (occupational hazard), and many suggest that one of the hard things about learning English is the multiple sounds for the same letter. We could actually move towards simplifying things by adding the complication of a few variations on letters....

Ah well, that probably won't happen. But hopefully you can see the basis for languages that might have for wanting an "Å" or an "Ö" or a "Č" or an "İ" in their midst. And then like I pointed out at the beginning of this post, if all of the variants of "I" did exist, it would be crazy to case them in any other way....

Of course, as you may have imagined this plan does not exactly co-exist well with case insensitve registries, or filesystems (like FAT and NTFS). Suddenly that idea that seems more sensible looks like an awful security risk (I do not even have to imagine; I have built versions of Windows on my own development machine that would not boot because they were unable to find the "HKLM\SOFTWARE\MICROSOFT\Windows" registry key and have heard tales of the ones that were unable to find WIN.ini). And I have witnessed code reviews that had scores of developers scan through thousands of files in the .NET Framework to (among other things) properly not use "Turkic" casing when trying to look at the filesystem or the registry. Its amazing how difficult and expensive it can be to make a product behave intuitively....

See how I slipped the proper design into that last paragraph? If you said "yes" then I feel very clever, otherwise I don't. :-)

The right design is to use CultureInfo.CurrentCulture in your .NET code any time you want to get the (possibly different) casing behavior seen in Turkish and Azeri, like in strings that your end users would see. At the same time you would use CultureInfo.InvariantCulture for those cases where you want the invariant, unchanging behavior. And in unmanaged code you want LCMapString with the LCMAP_UPPERCASE/LCMAP_LOWERCASE transformations to use or not use the LCMAP_LINGUISTIC_CASING flag, depending on the same conditons.

> many suggest that one of the hard things
> about learning English is the multiple
> sounds for the same letter

The same happens a lot in other languages too, usually not as much as in English, but a lot more than they think it does.

And it would not be solved by adding phonetic markers (like Vietnamese) or changing the rules for phonetic characters (like Japanese kana), because social trends will still result in changing some pronunciations and the new rules will become just as obsolete as the old rules were.

Yes, I suspected this was true, I just know that in talking to native speakers of other languages (there are a lot of those at Microsoft!) that no one ever seemed to think their language did it more....

Turkish folks sign on IRC a lot - or, at least, my IRC network (Nightstar) - and when they do they use the "windows-turkish" character set for 8-bit character sets (which the RFC demands and most clients use - though many modern clients use UTF-8) When they say I-with-dot or i-without-dot, they show up as Ý and ý, respectively. The cases are, of course, mixed in the way mentioned above, so when they miss (or just are lazy) it's rather obvious.

Vorn

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

2013/04/04 You need to dot every İ, not dot any I, dot every i, not dot any ı, and cross every t in Turkish

2010/09/26 If case conversion were harder, people would do it less

2008/11/14 When features collide (aka Your LCID sucks, but sometimes the bug sucks more)

2008/06/25 Seeing the tears, my heart went out to her as I asked her "Why the Long S?"

2008/05/12 İn tıtlıng thıs ınclusıon ın re: the ınterests of Turkısh İSVs, am İ just tryıng to buıld İ's and ı's ınto the tıtle of thıs daıly contrıbutıon to SİAO (SıaO), amıgo?

2007/04/25 The nature of OrdinalIgnoreCase vs. intuitive expectations

2005/08/02 New in Vista Beta 1: more use of the word 'linguistic'

2005/06/05 The dasBlog 'Turkish I' thing figured out

2005/04/04 When casing does not need to roundtrip in .NET

2005/03/04 "Michael, why does ToTitleCase suck so much?"

2005/01/16 My apparent obsession with "case" puns

2005/01/16 How [case-]insensitive (apologies to Frank Sinatra)

2004/12/11 What does "linguistic casing" mean?