How To [NOT] detect that a locale is bidi

by Michael S. Kaplan, published on 2006/03/03 08:15 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2006/03/03/542963.aspx


I thought I'd give a healthy sample of examples of how NOT to get the answer to the question "Is this a locale whose language uses bidirectional text?". All of these examples have actually been used at various points in Microsoft products, though they are [usually] being replaced as fast as they are found....

The names of the functions and identifying comments have been changed to protect the guilty (and the embarrassed).

Hopefully you have none of these in your own code -- or at the very least you hopefully will have none within a day or two of reading this post! :-)

BOOL IsBidi(LANGID langid) {
    switch(langid) {
        case 1025: // Arabic
        case 1037: // Hebrew
        case 1065: // Farsi
            return true;
        default:
            return false;
    }
}

 

BOOL IsBidi(LCID lcid) {
    return (PRIMARYLANGID(lcid) == LANG_ARABIC ||
            PRIMARYLANGID(lcid) == LANG_HEBREW)
}

 

BOOL IsBiDiLcid(LCID lcid)
{
    if (lcid == 0x0859 ||  //Sindhi (Arabic script)
        lcid == 0x0460)    //Kashmiri (Arabic script)
        return TRUE;   

    switch (PRIMARYLANGID(lcid))
    {
        case LANG_ARABIC:
        case LANG_HEBREW:
        case LANG_URDU:
        case LANG_FARSI:
        case LANG_PASHTO:
        case LANG_UIGHUR:
        case LANG_SYRIAC:
        case LANG_DIVEHI:
            return TRUE;
    }

    return FALSE;
}

 

BOOL GetBidiStatus(LANGID langid) {

    WCHAR szDayName[80] = {0};
    WORD  ctype = 0;

    GetLocaleInfo( langid,
                   LOCALE_SDAYNAME1,
                   szDayName,
                   sizeof(szDayName] / sizeof(WCHAR) );

    GetStringType( CT_CTYPE2,
                   &szDayName[0],
                   1,
                   &ctype );

    return( ctype & C2_RIGHTTOLEFT );
}

 

BOOL IsBidiLocale(LCID locale)
{
    DWORD layout;

    return(GetProcessDefaultLayout(&layout) &&
           (layout & LAYOUT_RTL));

}

 

BOOL IsBidirectional(LCID lcid) {
    LOCALESIGNATURE signature;

    GetLocaleInfoW(lcid,
                  LOCALE_FONTSIGNATURE
                  (LPWSTR)&signature,
                  sizeof(signature));

    if(signature.lsCsbSupported[0] & 0x60) {
        return(TRUE);
    else
        return(FALSE);
    }

}

 

BOOL FDetectBidi(LCID lcid) {
    FONTSIGNATURE fontsignature;

    GetLocaleInfoW(lcid,
                  LOCALE_FONTSIGNATURE
                  (LPWSTR)&fontsignature,
                  sizeof(fontsignature));

    return(fontsignature.fsUsb[0] & 0x2800));
}

 

The number of problems with the above is HUGE. Buffer overruns, wrong parameters, not checking for return values, ignoring the LCID, making the wrong checks. It is enough to make you want to tear your hair out. Or, if you are a bit more clear-headed, enough to make you want to tear out the hair of some other developers whose name appeared on the SLM logs checking in the code....

Though I thought the use of a locale field like SDAYNAME1 (several other examples along this line with different values were also around) was somewhat clever (especially in downlevel cases where the preferred solution is not available).

Though I worry about the cases where the field chosen may not start with a strong right-to-left character....

Anyway, the best solution is to use the mystical bit #123 of the Unicode subset bitfields of the LOCALESIGNATURE which is defined as "Layout progress: horizontal from right to left" and which is around in Windows 2000 and later. Here is an example of that best approach:

    WCHAR wchLCIDFontSig[16];
    
    if (GetLocaleInfoW(lcid,
                       LOCALE_FONTSIGNATURE,
                       &wchLCIDFontSig[0],
                       (sizeof(wchLCIDFontSig)/sizeof(WCHAR))) &&
        (wchLCIDFontSignature[7] & (WCHAR)0x0800))

or if you prefer to work with the actual locale signature rather than the bytes of a WCHAR array that are actually a locale signature underneath, it would be something more like the following (assuming my bit math is not faulty!):

    LOCALESIGNATURE localesig;
    
    if (GetLocaleInfoW(lcid,
                       LOCALE_FONTSIGNATURE,
                       (LPWSTR)&localesig,
                       (sizeof(localesig)/sizeof(WCHAR))) &&
        (localesig.lsUsb[3] & 0x08000000))

What I find most impressive about the wide array of the different attempts people made is

Do you have a favorite? :-)

 

This post brought to you by "ת" (U+05ea, a.k.a. HEBREW LETTER TAV)


# Maurits on Friday, March 03, 2006 11:25 AM:

Huh?  Surely NO language uses bidirectional text.  Locales are either left-to-right, which is one direction; or right-to-left, which is also one direction.

# Michael S. Kaplan on Friday, March 03, 2006 11:28 AM:

Actually, if you look at Hebrew, numbers are actually LTR. And there is always *some* other text shown or entered in a locale (file name extensions, if nothing else!), and in that case you are dealing with a locale that expects bidirectional text....

# Nick Lamb on Friday, March 03, 2006 12:30 PM:

There are two levels of confusion here, because of course there's no such thing as a "Bi-directional" locale, only bi-directional text. Firstly Microsoft has this system-wide confusion because they artificially override UAX #9 paragraph level. Michael can probably tell us how many of his examples come from code doing that.

So there's a lot of code which is doing something that's fundamentally wrong, no matter how it goes about it, and that code just shouldn't exist. From the sounds of things that's not going to get fixed this decade (e.g. in Vista), and so users who care about it will (as with other things Michael's mentioned like the correct Unicode encoding of currency symbols or working font fallback) have to migrate away.

Secondly, there's a smaller scale confusion, when trying to figure out what the user's expectations are, for example at the start of a new document, should the cursor appear on the right or the left? This is a locale issue, and it does need some code, but naming the predicate "isBiDi?" or similar in this context betrays a lack of understanding. The real question is "Direction?" and the answers are either LTR (cursor starts on the left) or RTL. (starts on the right).

You can see all of this done right elsewhere, just not in Windows.

# Michael S. Kaplan on Friday, March 03, 2006 12:59 PM:

Hi Nick,

I understand you have strong feelings about the particular way that Microsoft implements UAX #9 (by now I am sure anyone who reads the comments on this blog understands *that*), the fact that UAX#9 hs been moving closer to Microsoft's implementation in properties of course non-withstanding.

But I am pretty sure any developer who deals with bidi understands the fact that knowing whether to expect large amounts of RTL and Bidi text by default, and further that talking about a "bidi locale" is just shorthand for for that. and it is not confusing to anyone really, except for people who choose to be confused.

It is something seen beyond Windows, as you mention, but mainly because this is understood by a LOT of people. And it is only confusing to us if we allow it to be. The only platforms that do not have this understanding are the ones that do not support Bidi anyway.... :-)

As for the intent of the code being wrong, there is honestly no way to guess that from the samples since they do not explain where the locale is coming from -- I gave examples of the complexities there in http://blogs.msdn.com/michkap/archive/2006/02/08/527375.aspx and other posts, and that is really understandable, too -- choosing the locale upon which to make such decisions is a complex issue.

For the record, in most cases the intent and use of the function was actually correct, it was only the method by which the function worked that was wrong. So if you want an excuse to migrate away from Microsoft products you'll have to find another one. :-)

# Maurits on Friday, March 03, 2006 2:10 PM:

> numbers are actually LTR

Really?  Or are they little-endian?  How do Hebrew users write numbers, or read numbers allowed?

What about grouped numbers, like phone numbers?  Do they read the groups RTL and the numbers in each group LTR?

Oy, I'm confused.

Exercise: Translate the title of this song into a Hebrew locale...

http://en.wikipedia.org/wiki/867-5309/Jenny

# Serge Wautier on Friday, March 03, 2006 2:24 PM:

IIRC, the font signature method was not available on NT4 (don't remember for 9x).
Therefore, the good Dr Intl recommends adding a platform check and fallback on comparing PRIMARYLANGID() with LANG_HEBREW, LANG_ARABIC and LANG_FARSI on pre W2K platforms.

http://www.microsoft.com/globaldev/DrIntl/columns/003/default.mspx#EDAA

# Maurits on Friday, March 03, 2006 2:34 PM:

"read numbers allowed" => "read numbers aloud"

# Nick Lamb on Friday, March 03, 2006 5:17 PM:

“But I am pretty sure any developer who deals with bidi understands the fact that knowing whether to expect large amounts of RTL and Bidi text by default, and further that talking about a "bidi locale" is just shorthand for for that. and it is not confusing to anyone really, except for people who choose to be confused.”

This paragraph seems incomplete or has some typographical problem. I guess the intent is to say that Microsoft developers are testing for a "bidi locale" purely as an optimisation. If that's right. can you give a brief example of such an optimisation from real code Michael?

If that's not what you meant, can you rephrase?

# Michael S. Kaplan on Friday, March 03, 2006 5:56 PM:

I am saying that the question being asked that these functions try to ask is a valid one that is only confusing to those who wish to be confused. The locale model is real and is often used to tag resources and content, and the return of this function is in no way a Microsoft-specfic construct or a confusing construct.

And the question is of use to people outside of Microsoft.

Your opinions about Microsoft a-la-UAX #9 are noted, as always, but are not relevant to the current discussion. :-)

# Mihai on Friday, March 03, 2006 9:25 PM:

The real problem is "Windows 2000 and later"

As Serge noticed, even Dr Intl recommends one of the (now) wrong methods above. What was one supposed to do, except following the best practices of the time.

I don't say "this should have been there in Win 3.0", but if the code is old, it is not fair to pick on it too much.

In fact, once in a while I see some of my old code. And, man, is crappy. But this is good. If today I would write code like I did 3 years ago, it means I have learned nothing.

# Michael S. Kaplan on Saturday, March 04, 2006 2:26 AM:

Well, to be honest, I actually think the GetStringType solution is a better (and more scalable) one then hard coding LCIDs, but that is just one man's opinion....

# Shoshannah on Saturday, March 04, 2006 5:01 AM:

Maurits- Numbers in Hebrew are read LTR, even phone numbers (which are grouped) are read LTR.
I'm looking at a (Hebrew) ad for Domino's Pizza which happens to be next to me- it has:
HEBREW HEBREW HEBREW 1-700-70-70-70 HEBREW HEBREW
The Hebrew is RTL, the number is read LTR.

# Ben Yeomans on Wednesday, May 03, 2006 8:50 AM:

What about Divehi (locale ID 0x0465)?

This uses the Thaana script which is written Right-to-Left, but your recommended method does not recognise this as an RTL locale.

There is also a bug in the second version of the recommended method. It should be:
       (localesig.lsUsb[3] & 0x8000000))

Personally I like the GetStringType solution, especially as I am working mainly with Windows CE which doesn't appear to support the recommended method. I have adapted it slightly to scan the whole string, not just the first character, and changed the test from ( ctype & C2_RIGHTTOLEFT ) to ( ctype == C2_RIGHTTOLEFT ).

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2012/01/16 I'm reasonably certain that those who disagree with me here are wrong!

2010/04/19 The sad sad tale of the BARREE YEH

2006/12/21 It's not right when IsRightToLeft is wrong

2006/09/01 Cue the smarter version of GetDateFormat... ok, it's a wrap!

2006/07/12 How To detect that a culture is bidi

go to newer or older post, or back to index or month or day