İn tıtlıng thıs ınclusıon ın re: the ınterests of Turkısh İSVs, am İ just tryıng to buıld İ's and ı's ınto the tıtle of thıs daıly contrıbutıon to SİAO (SıaO), amıgo?

by Michael S. Kaplan, published on 2008/05/12 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/05/12/8486796.aspx


It is a commonly reported issue in Windows and many components that run upon it, a recent one can be seen on the Connect site, here:

Description:
There is a conversion problem in the c/c++ runtime library. Turkish characters ı/I and i/İ are converted incorrectly to upper/lower case.

Code that reproduces the issue:

#include "stdafx.h"
#include <string>
#include <xlocale>
#include <iostream>
#include <algorithm>
#include <functional>
int _tmain(int argc, _TCHAR* argv[]) {
    std::locale a = std::locale::locale();
    std::locale::global (std::locale("Turkish"));
    std::locale b = std::locale::locale();
    std::cout << a.name().c_str() << " -> " << b.name().c_str() << std::endl;
    std::wstring lowerCase = _T("ığüşiöç");
    std::wstring upperCase = _T("IĞÜŞİÖÇ");
    std::wstring upperResult, lowerResult;

    upperResult.resize(lowerCase.length());
    lowerResult.resize(lowerCase.length());
    std::transform(lowerCase.begin(), lowerCase.end(), upperResult.begin(), towupper);
    std::transform(upperCase.begin(), upperCase.end(), lowerResult.begin(), towlower);

    std::wcout << lowerCase << std::endl;
    std::wcout << lowerResult << std::endl;
    std::wcout << upperCase << std::endl;
    std::wcout << upperResult << std::endl;
    if (upperCase != upperResult || lowerCase != lowerResult) {
        std::cout << "Conversion failed" << std::endl;
    }
    return 0;
}

Observed Results:

    C -> Turkish_Turkey.1254
    ığüşiöç
    iğüşİöç
    IĞÜŞİÖÇ
    ıĞÜŞIÖÇ
    Conversion failed

Expected Results:

    C -> Turkish_Turkey.1254
    ığüşiöç
    ığüşiöç
    IĞÜŞİÖÇ
    IĞÜŞİÖÇ

The issue boils down to the very simple fact that the C runtime's casing functions are being used underneath this code, and LCMapString is being called underneath that.

They are passing the Turkish locale to LCMapString, but they are not passing the LCMAP_LINGUISTIC_CASING function, which means that Turkic case tables are not being used.

On the surface, there is an easy fix -- just make LCMAP_LINGUISTIC_CASING get passed here, right?

Though it is of course not that simple or it would have been fixed years ago, and I wouldn't be blogging about it here....

I'll point toward two blog posts from the end of 2004:

Especially that second one, which points out the two things that the LCMAP_LINGUISTIC_CASING flag does:

  1. You get the right behavior for Turkic locales like Turkish and Azeri;
  2. You get a bunch of one-way mappings on all locales, e.g. U+03f1 (Greek Rho Symbol) will uppercase to U+03a1 (Capital Greek Rho), which will lowercase to U+03c1 (Small Greek Rho).

Now #1 is the "fix" for this bug, sure. It even kind of goes along with the C/C++ standards in this regard, e.g. 7.25.3.2.1.3 and 7.25.3.2.2.3 of C99:

7.25.3.2.1.3 (the towlowercase function): If the argument is a wide character for which iswupper is true and there are one or more corresponding wide characters, as specified by the current locale, for which iswlower is true, the towlower function returns one of the corresponding wide characters (always the same one for any given locale); otherwise, the argument is returned unchanged.

7.25.3.2.2.3 (the towuppercase function): If the argument is a wide character for which iswlower is true and there are one or more corresponding wide characters, as specified by the current locale, for which iswupper is true, the towupper function returns one of the corresponding characters (always the same one for any given locale); otherwise, the argument is returned unchanged.

It is the second point in that definition, which adds all of the following other mappings, that makes all of this messier.

Those other mappings are:

Uppercase (all locales other than Azeri and Turkish):

U+0131 --> U+0049 (LATIN SMALL LETTER DOTLESS I --> LATIN CAPITAL LETTER I)
U+01c5 --> U+01c4 (LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON --> LATIN CAPITAL LETTER DZ WITH CARON)
U+01c8 --> U+01c7 (LATIN CAPITAL LETTER L WITH SMALL LETTER J --> LATIN CAPITAL LETTER LJ)
U+01cb --> U+01ca (LATIN CAPITAL LETTER N WITH SMALL LETTER J --> LATIN CAPITAL LETTER NJ)
U+01f2 --> U+01f1 (LATIN CAPITAL LETTER D WITH SMALL LETTER Z --> LATIN CAPITAL LETTER DZ)
U+0390 --> U+03aa (GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS --> GREEK CAPITAL LETTER IOTA WITH DIALYTIKA)
U+03b0 --> U+03ab (GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS --> GREEK CAPITAL LETTER UPSILON WITH DIALYTIKA)
U+03d0 --> U+0392 (GREEK BETA SYMBOL --> GREEK CAPITAL LETTER BETA)
U+03d1 --> U+0398 (GREEK THETA SYMBOL --> GREEK CAPITAL LETTER THETA)
U+03d5 --> U+03a6 (GREEK SMALL LETTER DIGAMMA --> GREEK CAPITAL LETTER PHI)
U+03d6 --> U+03a0 (GREEK PI SYMBOL --> GREEK CAPITAL LETTER PI)
U+03f0 --> U+039a (GREEK KAPPA SYMBOL --> GREEK CAPITAL LETTER KAPPA)
U+03f1 --> U+03a1 (GREEK RHO SYMBOL --> GREEK CAPITAL LETTER RHO)

Lowercase (all locales other than Azeri and Turkish):

U+0130 --> U+0069 (LATIN CAPITAL LETTER I WITH DOT ABOVE --> LATIN SMALL LETTER I)
U+01c5 --> U+01c6 (LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON --> LATIN SMALL LETTER DZ WITH CARON)
U+01c8 --> U+01c9 (LATIN CAPITAL LETTER L WITH SMALL LETTER J --> LATIN SMALL LETTER LJ)
U+01cb --> U+01cc (LATIN CAPITAL LETTER N WITH SMALL LETTER J --> LATIN SMALL LETTER NJ)
U+01f2 --> U+01f3 (LATIN CAPITAL LETTER D WITH SMALL LETTER Z --> LATIN SMALL LETTER DZ)
U+03d2 --> U+03c5 (GREEK UPSILON WITH HOOK SYMBOL --> GREEK SMALL LETTER UPSILON)
U+03d3 --> U+03cd (GREEK UPSILON WITH ACUTE AND HOOK SYMBOL --> GREEK SMALL LETTER UPSILON WITH TONOS)
U+03d4 --> U+03cb (GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL --> GREEK SMALL LETTER UPSILON WITH DIALYTIKA)

Now these somewhat random behaviors exist in all of the case mappings that happen in .NET except for the ones based on the InvariantCulture, which ends up as the source for lot of unexpected behavior that pops up from time to time, and not only due to the fact that as lists go they are incomplete....

You can look at them probably guess some of the problems they can cause with these strange one-way conversions!

But adding this to the CRT's behavior would essentially be adding these non-reversible transformations to almost every call. Which is not really desirable behavior, in some people's minds....

The real question would be whether this bug would be considered more reasonable to fix if Win32 supported a more granular kind of functionality than LCMAP_LINGUISTIC_CASING provides -- a way that would keep the linguistically useful separate from the random "rehabilitate symbols" crap and all of the rest....

 

This blog brought to you by all of the above cited Unicode characters...


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/06/25 Seeing the tears, my heart went out to her as I asked her "Why the Long S?"

go to newer or older post, or back to index or month or day