The [Upper]Case of the Turkish İ (or: Casing, the 2nd)

by Michael S. Kaplan, published on 2004/12/03 08:13 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2004/12/03/274288.aspx


I think the Turkish folks have it right.

After all, say that we had all of the following characters in English:

  1. I   U+0049   LATIN CAPITAL LETTER I
  2. i   U+0069   LATIN SMALL LETTER I
  3. İ   U+0130   LATIN CAPITAL LETTER I WITH DOT ABOVE
  4. ı   U+0131   LATIN SMALL LETTER DOTLESS I

Wouldn't we do the case mapping to put the dotted and dotless variants together (so that both #1/#4 and #3/#2 would be case pairs)? Be honest, doesn't that make more sense?

We even have a good reason, if you think about it. I mean, its not like the "I" in "him" sounds the one in "nice" and neither of them sounds like the one in "niece" and none of them sounds like the one with no sound in "friend". So with all of those different sounds, English would be a lot simpler if we had an extra pair of letters to work with. I have talked to a lot of native speakers of other languages about languages (occupational hazard), and many suggest that one of the hard things about learning English is the multiple sounds for the same letter. We could actually move towards simplifying things by adding the complication of a few variations on letters....

Ah well, that probably won't happen. But hopefully you can see the basis for languages that might have for wanting an "Å" or an "Ö" or a "Č" or an "İ" in their midst. And then like I pointed out at the beginning of this post, if all of the variants of "I" did exist, it would be crazy to case them in any other way....

Of course, as you may have imagined this plan does not exactly co-exist well with case insensitve registries, or filesystems (like FAT and NTFS). Suddenly that idea that seems more sensible looks like an awful security risk (I do not even have to imagine; I have built versions of Windows on my own development machine that would not boot because they were unable to find the "HKLM\SOFTWARE\MICROSOFT\Windows" registry key and have heard tales of the ones that were unable to find WIN.ini). And I have witnessed code reviews that had scores of developers scan through thousands of files in the .NET Framework to (among other things) properly not use "Turkic" casing when trying to look at the filesystem or the registry. Its amazing how difficult and expensive it can be to make a product behave intuitively....

See how I slipped the proper design into that last paragraph? If you said "yes" then I feel very clever, otherwise I don't. :-)

The right design is to use CultureInfo.CurrentCulture in your .NET code any time you want to get the (possibly different) casing behavior seen in Turkish and Azeri, like in strings that your end users would see. At the same time you would use CultureInfo.InvariantCulture for those cases where you want the invariant, unchanging behavior. And in unmanaged code you want LCMapString with the LCMAP_UPPERCASE/LCMAP_LOWERCASE transformations to use or not use the LCMAP_LINGUISTIC_CASING flag, depending on the same conditons.

Its easy to remember it and do it, if you learn it in the first place. :-)


# Norman Diamond on 24 Dec 2004 7:41 PM:

> many suggest that one of the hard things
> about learning English is the multiple
> sounds for the same letter

The same happens a lot in other languages too, usually not as much as in English, but a lot more than they think it does.

And it would not be solved by adding phonetic markers (like Vietnamese) or changing the rules for phonetic characters (like Japanese kana), because social trends will still result in changing some pronunciations and the new rules will become just as obsolete as the old rules were.

# Michael Kaplan on 24 Dec 2004 8:06 PM:

Yes, I suspected this was true, I just know that in talking to native speakers of other languages (there are a lot of those at Microsoft!) that no one ever seemed to think their language did it more....

# Vorn on 24 Dec 2004 10:59 PM:

Turkish folks sign on IRC a lot - or, at least, my IRC network (Nightstar) - and when they do they use the "windows-turkish" character set for 8-bit character sets (which the RFC demands and most clients use - though many modern clients use UTF-8) When they say I-with-dot or i-without-dot, they show up as Ý and ý, respectively. The cases are, of course, mixed in the way mentioned above, so when they miss (or just are lazy) it's rather obvious.

Vorn

# cumaozturk on 27 Dec 2008 10:18 AM:

Help Turkısh

# cumaozturk on 27 Dec 2008 10:20 AM:

Turkısh help


referenced by

2013/04/04 You need to dot every İ, not dot any I, dot every i, not dot any ı, and cross every t in Turkish

2010/09/26 If case conversion were harder, people would do it less

2008/11/14 When features collide (aka Your LCID sucks, but sometimes the bug sucks more)

2008/06/25 Seeing the tears, my heart went out to her as I asked her "Why the Long S?"

2008/05/12 İn tıtlıng thıs ınclusıon ın re: the ınterests of Turkısh İSVs, am İ just tryıng to buıld İ's and ı's ınto the tıtle of thıs daıly contrıbutıon to SİAO (SıaO), amıgo?

2007/04/25 The nature of OrdinalIgnoreCase vs. intuitive expectations

2005/08/02 New in Vista Beta 1: more use of the word 'linguistic'

2005/06/05 The dasBlog 'Turkish I' thing figured out

2005/04/04 When casing does not need to roundtrip in .NET

2005/03/04 "Michael, why does ToTitleCase suck so much?"

2005/01/16 My apparent obsession with "case" puns

2005/01/16 How [case-]insensitive (apologies to Frank Sinatra)

2004/12/11 What does "linguistic casing" mean?

go to newer or older post, or back to index or month or day