The i of the Turk, and the Turkey test

by Michael S. Kaplan, published on 2010/01/14 08:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/01/14/9948466.aspx


Gwyneth's question was an interesting one:

Out of curiosity, do you know the history of why Unicode didn’t create separate characters for the Turkish i? The i is only character that changes casing based on the language (Turkish/Azeri). I did a little searching online, but didn’t find any obvious references to the rationale behind this.

Thanks!

As was Peter's answer:

That would have been a Unicode 1.0 decision, before I had even heard of Unicode (which I first heard about around 1992/3), so I’m not sure. I suspect it was pre-determined by legacy standards. Encoding a Turkish i probably wouldn’t have been enough; it probably would have been necessary to encode a separate Turkish I as well. Arguably, both would have been duplicating characters and certainly they would have resulted in confusion with 0049/0069 – likely with some data getting encoded one way and other data using the other. Chances are we would have ended up facing the casing issues as well as data in mixed representations.

This sums up the principal reasons quite nicely!

It is unfortunate that the experience most people have with Turkish is how it highlights code that does not handle globalization issues (blogs like this one summarize the approach quite well and the "Turkey Test" is no worse that anything else one could call it).

Though I think I still owe a blog post discussing vowel harmony and other linguistic features affecting Turkish. It is coming, eventually. The goal of giving Turkish a better legacy than the Turkey Test may be tilting at windmills, but maybe I can at least point out there is more out there....


John Cowan on 14 Jan 2010 3:39 PM:

It should really be called the Turkic Test, since Turkish isn't the only relevant Turkic language here.

One of the first things I proposed when I joined unicode@unicode.org was that two new characters be encoded to completely separate Turkic I, İ, i, and ı from the general Latin I and i.  One of the old Unicode farts of the list promptly explained to me that there was far too much mixed-language text in ISO 8859-9, which made no such distinctions, and that expecting people to go through and fix it all up was hopeless -- and I was enlightened.

When I became a old Unicode fart myself, I posted a similar explanation for the next few newbies on the list with the same clever idea, and it wound up in the Unicode FAQ (or at least I thought so, but I can't find it now).

My tongue-in-cheek explanation of why "ctype.h" doesn't work well for Unicode is still there, though.

Michael S. Kaplan on 19 Jan 2010 3:24 PM:

It should really be called the Turkic Test, since Turkish isn't the only relevant Turkic language here.

John, in principle I agree; in practice the number of people aming at Turkish is small enough; the number aiming at non-Turkish Turkic languages is so small that I am unsurprised the distinction was not made in the site I linked to....


go to newer or older post, or back to index or month or day