Normalization vs. .NET text elements

by Michael S. Kaplan, published on 2005/04/30 03:39 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/04/30/413676.aspx


Ok, here is the updated code for that internationally savvy palindrome checker. It supports that interesting situation with ligatures like (U+fb02, a.k.a. LATIN SMALL LIGATURE FL) vs. an lf on the far side, originally suggested by our old friend Maurits (with comments):

//////////////////////////////////////////
//
// IsPalindrome
//
// in : a string
// out: true if the string passed in is a palindrome.
//
// NOTES: This function handles both canonical/compatibility
//        equivalences and grapheme clusters (a.k.a. text
//        elements) as defined by the Unicode Standard.
//
//////////////////////////////////////////
bool IsPalindrome(string st) {

    // A null string or a ZLS is not a palindrome
    if ((null == st) || (0 == st.Length)) return false;

    // Convert to NFKC and set up the text element detection object
    StringInfo si = new StringInfo(st.Normalize(NormalizationForm.FormKC));
    int count = si.LengthInTextElements;

    for (int i = 0; i < (count / 2); i++) {
        // get the text elements for comparison
        string st1 = si.SubstringByTextElements(i, 1);
        string st2 = si.SubstringByTextElements(count - i - 1, 1);

        // see if the text elements on each side are linguistuically equivalent
        if (CultureInfo.CurrentCulture.CompareInfo.Compare(st1, st2) != 0) {
            // they are not, so it is not a palindrome. 
            return(false);
        }
    }

    // both ends appear to be equivalent; it is a palindrome.
    return (true);
}

Now, Maurits went on in his Channel 9 posting to discuss sort elements, or cases when two or more characters are to be given a single sort weight (kind of the opposite of an expansion like these ligatures). However, in my opinion these are not really suitable for a palindrome detection algorithm, as I don't think they are usually treated as letters except in the case where they are also treated as unique text elements (the case covered by the StringInfo code).

Any native speakers of languages with such constructs as the Spanish ch and the Hungarian dzs who think they should or should not be treated as a unit in trying to detect palindomosity should feel free to leave a comment to that effect. Also, if any of my collagues in the GIFT group agree or disagree here (and they are reading this!) they are invited to do the same (or stop me in the hall and accost me with this information!).

If I mistaken on this point, then a very interesting problem develops since there is really not an easy method for detecting such cases given the current collation function set (although I can imagine a few avenues of attack and we obviously have the underlying data if we had to support an IsPalindome function in Win32 NLS!). This might even make an interesting interview question one day for a very talented candidate.... :-)

Now, aside from all that, it is important to note that normalization makes some uses of text elements in this context completely unneccessary -- after all, either technology will treat U+0061 U+030a (a + combining ring) as identical to U+00e5 (a ring), one by conversion and the other by giving identical sort weights. Therefore, there is some overlap between the two technologies. However, there are some differences:

So it is fair to say that both technologies as provided in the Whidbey release are potentially useful in (of all things) the detection of palindromes, today and tomorow. I am sure the people who spec'ed, developed, and tested these feature are very proud for the technologial advances!

 

This post brought to you by "แซ์" (U+0e41 U+0e0b U+0e4c, a.k.a. THAI CHARACTERS SARA AE + SO SO + THANTHAKHAT)
The last two codepoints of which make up a text element, but all three of which make up a unique sort element!


# Krisztian Pinter on 30 Apr 2005 11:46 AM:

Just a quick note, somewhat off topic.

In Hungarian, there is no dsz letter. Here is the complete list of composed letters:

cs ccs dz ddz dzs gy ggy ly lly ny nny sz ssz ty tty zs zzs

Those that starts with double characters (ssz), are actually doubles (szsz -> ssz). The only true three character letter is dzs. I don't know if it has a double like 'ddzs' because we have no words with this letter. But it can exist theoretically.

# Michael S. Kaplan on 30 Apr 2005 11:51 AM:

Yep, thats what I thought -- these sort elements are not thought of as letters in and of themselves. ;-)

I'll talk more about Hungarian another day; the requirements for "double compressions" added some interesting challenges for both Windows and the .NET Framework....

# Michael S. Kaplan on 30 Apr 2005 11:52 AM:

(re-reading your post)

Or are you saying that the other cases ARE thought of as letters?

# krisztian pinter on 2 May 2005 11:33 AM:

hungarian ABC is like this

A Á B C CS D DZ DZS E É ...

so the answer is, CS has the same rights as a letter, as C or F. it is one letter.

it sorts like a letter. CS comes later then C-Z

it hyphenates like a letter: never separated

so if you want me to be absolutely happy, you treat words of pattern "sz...sz" palindromes.

# Michael S. Kaplan on 2 May 2005 11:47 AM:

Ah, that is an interesting development. It certainly puts a cat among the collation pigeons!

I'll give it a little thought, maybe post back with a comment or a post sometime soon. It is an interesting case where the data is present but is not really provided in a usable form for this type of inquiry....

# Maurits [MSFT] on 2 May 2005 12:15 PM:

How does your code handle symbols? Suppose I have an asymmetrically-symboled palindrome like

Madam, I'm Adam

Will this compare:
M to m - the same
a to a - the same
d to d - the same
a to A - the same
m to " " - different! Not a palindrome.

Or does " " not count as a text element?

# Maurits [MSFT] on 2 May 2005 12:28 PM:

I see...

What I ended up doing is allowing checkboxes for the various CompareOptions, with all of them checked by default (Ignore Case, Ignore Symbols, Ignore Kana, ...)

Then I would skip text elements that were "the same as" an empty string in my iterator. By "the same as" I mean that Compare() returned 0.

# Michael S. Kaplan on 2 May 2005 12:32 PM:

If you want case insensitivity, you'd have to pass the flag for it in that Compare call (CompareOptions.IgnoreCase).

A space is indeed a text element, or even part of one if you combine it with a non-spacing character....

As it stands right now, spaces would flummox the code, you'd have to add something to keep counters on both ends, skip elements that did not meet particular criteria (such as symbols or whitespace) and such.

Easy enough with the new CharUnicodeInfo class on the text element and the code points within it....

# Maurits [MSFT] on 2 May 2005 12:38 PM:

I should say that I skipped all textelements "e" where

Compare(e, "", co) == 0

for the particular CompareOptions "co" chosen by the caller.

# Tanveer Badar on 20 Dec 2007 9:25 AM:

Why didn't you do this

if ( string.IsNullOrEmpty( st ) ) return false;

instead of

if ((null == st) || (0 == st.Length)) return false;

Or are there any internationalization issues with IsNullOrEmpty?

# Michael S. Kaplan on 20 Dec 2007 11:46 AM:

No, it is just a choice to not push stack by mking it a method call. We all live in a world where we choose what code we will write (or if/when we will write code!).


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2005/06/15 Once more into the palindrome

go to newer or older post, or back to index or month or day