Arabic? English? Both? Neither?

by Michael S. Kaplan, published on 2010/04/06 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/04/06/9989320.aspx


David's question was:

Team,

Is there a way to detect if there are a mixture of English and Arabic, only English, or only Arabic characters in a String?
If there is some .NET API for detecting this, or some known algorithm, we’d greatly appreciate knowing about it.
We’re in desperate need of something to detect these scenarios!

We’ve seen the IsRightToLeft Member of the System.Globalization.TextInfo Type, but this doesn’t detect what we’re truly after.

Thank You,

He is right, IsRightToLeft is not the right thing to call here.

That is for telling about a culture, not a specific string.

Colleague Tom Moore recommended one of my blogs:

Does this help?  http://blogs.msdn.com/michkap/archive/2007/01/06/1421178.aspx

For example, in RTL UI, if the string has some characters in it with System.Globalization.BidiCategory == LeftToRight, but does not have any characters in it with System.Globalization.BidiCategory == RightToLeft, then you might consider setting the span dir to ltr.

Now the managed world has no other answers for here, but in the native world there is some hope!

In the NLS API, the GetStringScripts function (for Vista and later), or the DownlevelGetStringScripts function (with the downlevel NLS library on XP or Server 2003) will tell you what scripts are in the string.

So you can't get language per se, but looking for Latn and/or Arab in the string the function gives you will make it easy to make the requested assessment....

And if you need it in managed code than it is just a mere p/invoke away (though this one isn't on p/invoke.net -- yet?):

    [DllImport("kernel32.dll", CharSet=CharSet.Unicode, ExactSpelling=true)]
    internal static extern int GetStringScripts(uint dwFlags,
                                                string lpString,
                                                int cchString,
                                                StringBuilder lpScripts,
                                                int cchScripts);

and you can go from there....


Pavanaja U B on 6 Apr 2010 8:16 AM:

Once I had a similar requirement. User is supposed to enter his name in a textbox in Hindi. I enabled the Hindi IME via .NET, the moment the cursor is placed inside the textbox. But what if he changes the IME to English and starts entering in English? I just checked every character's Unicode value and just rejected all those whose Unicode value did not fall in the Hindi codepage.

-Pavanaja

John Cowan on 6 Apr 2010 11:58 AM:

If the true issue is (as I suppose) "Is the string L2R, R2L, or bidirectional?" rather than literally "Are there English characters, Arabic characters, or both?", then the managed-world approach is better IMHO.  There are in fact quite a few R2L-predominant scripts in Unicode, and it's better not to look for specific ones like "Arab", since the same issues will apply to "Hebr", "Syrc", "Thaa", "Nkoo", "Cprt", "Phnx", "Khor", and (still in the pipeline) "Merc".

Michael S. Kaplan on 6 Apr 2010 12:50 PM:

One could always use GetStringTypeW for Bidi character status....

Michael S. Kaplan on 6 Apr 2010 5:17 PM:

Either way, in this specific case it was an "only English and/or Arabic" situation. :)

Jan Goyvaerts on 7 Apr 2010 12:03 AM:

If the regular expression [a-zA-Z] finds a match in the string, then it contains at least one English letter.  If the regex \p{IsArabic} finds a match then the string contains at least one Arabic character.  Or, if the regex [\p{IsArabic}-[\P{L}]] finds a match then the string contains at least one Arabic letter.  The latter two regexes assume you're using the .NET regex flavor.

See http://www.regular-expressions.info/unicode.html for details on matching Unicode characters with regular expressions.

Michael S. Kaplan on 7 Apr 2010 12:34 AM:

That will likely end up taking longer. :-)

Michael S. Kaplan on 7 Apr 2010 7:47 AM:

Or, as Jamie Zawinki put it:

Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems.

Jan Goyvaerts on 8 Apr 2010 6:15 AM:

Why would the regex take longer?  For somebody familiar with regular expressions, the regex is much faster to write than interpreting the results of GetStringScripts.  Unless millions of strings need to be checked in a tight loop, the performance of the code is irrelevant (both solutions will be fast enough).

The people who have two problems are those applying a solution they don't understand to a problem where it isn't appropriate.  It can be said about anything.  Jamie was ranting against Perl and repurposed an older quote about sed.  See http://regex.info/blog/2006-09-15/247 and http://regex.info/blog/2006-09-15/247#comment-3085

For the stated purpose of finding out whether a string contains English characters, Arabic characters, or both, the two regular expressions will do the job just fine.  I did not say that regular expressions are the solution to all language analysis problems.

Jan Kučera on 18 Apr 2010 7:19 AM:

Not sure if I am way off when nobody mentioned it, but don't we have the cool Extended Linguistic Services now in Windows 7? The Windows Api Codepack even provides .NET wrappers for them.

Modifying the included sample a bit:

MappingService scriptDetection = new MappingService(MappingAvailableServices.ScriptDetection);

using (MappingPropertyBag bag = scriptDetection.RecognizeText("English العربية", null))
{
   MappingDataRange[] ranges = bag.GetResultRanges();
   Console.WriteLine("Recognized {0} script ranges", ranges.Length);

   NullTerminatedStringFormatter formatter = new NullTerminatedStringFormatter();

   foreach (MappingDataRange range in ranges)
   {
       Console.WriteLine("Range from {0} to {1}, script {2}", range.StartIndex, range.EndIndex, range.FormatData(formatter));
   }
}

outputs

Recognized 2 script ranges

Range from 0 to 7, script Latn

Range from 8 to 14, script Arab

Isn't this what is needed?

Moreover, I think that ELS can tell if the text is in English, not only Latin.. not sure what the David's scenario is, though.

Michael S. Kaplan on 18 Apr 2010 8:33 AM:

The original question had an XP requirement, so it didn't come up. Though it is not a bad idea at all. :-)

David Hardy on 3 May 2010 10:32 AM:

@Michael S. Kaplan

As a P/Invoke novice i'm trying to understand why you chose "StringBuilder lpScripts" instead of "String lpScripts"

I know it's off topic, but still...

Michael S. Kaplan on 3 May 2010 11:41 AM:

I actually grabbed it from the MSDN site, but I usually prefer StringBuilder to String for out params....


go to newer or older post, or back to index or month or day