Once more into the palindrome
by Michael S. Kaplan, published on 2005/06/15 05:21 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/06/15/429279.aspx
Back in April, in a series of posts, I talked about many cool features in Whidbey, from comparison (been there all along) to text elements (been there all along, better in Whidbey) to normalization (new in Whidbey), all in the context of one of the most important features that any computer program can possibly support -- palindrome detection:
Looking for that internationally savvy palindrome checker....
Where did the new StringInfo stuff come from?
Normalization vs. .NET text elements
It was all very cool stuff, and a lot of fun to do.
And then I blew it.
In that last post, I said:
Now, Maurits went on in his Channel 9 posting to discuss sort elements, or cases when two or more characters are to be given a single sort weight (kind of the opposite of an expansion like these ligatures). However, in my opinion these are not really suitable for a palindrome detection algorithm, as I don't think they are usually treated as letters except in the case where they are also treated as unique text elements (the case covered by the StringInfo code).
Any native speakers of languages with such constructs as the Spanish ch and the Hungarian dzs who think they should or should not be treated as a unit in trying to detect palindomosity should feel free to leave a comment to that effect. Also, if any of my collagues in the GIFT group agree or disagree here (and they are reading this!) they are invited to do the same (or stop me in the hall and accost me with this information!).
Well, the good news is that no one accosted me.
The bad news is that some people posted that I was wrong, and colleagues informed me that I was wrong. Those 'sort elements' (for lack of a better term) would be taken as a unit, just as the text elements are.
Damn. I hate being wrong.
Especially when I pointed out that there was no good way pick up these 'sort element' compressions, where two or more UTF-16 code points are given the weight of a single element. We have the data, sure. But we do not currently excpose it in a convenient way to answer this particular question.
And to top it all off, there is no good way to reverse a sort key, though the sort key obviously contains the information about when these compressions (known in the Unicode Collation Algorithm as 'contractions').
I stewed on this for a bit today, after finally deciding to tackle this post. There had to be some way short of a new Win32 or .NET function to make this work.
So I closed my eyes and told myself to pretend I was interviewing for a job at Microsoft. :-)
Then something suddenly occurred to me!
We do not care about reversing or undoing the sort key; we only care about comparing the pieces of each sort key! And we can do that, we have the bytes, so why not just compare them?
Let's take a hideously contrived string to test for palindromodity: abddzsCÅbÅCdzsdzsba
The code points are:
0061 0062 0064 0064 007a 0073 0043 00c5 0062 0041 030a 0043 0064 007a 0073 0064 007a 0073 0062 0061
Note especially the Hungarian dzs double compression ('ddzs' == 'dzsdzs') and the A Ring in both precomposed and composite forms (code points marked in bold).
We grab the sort key:
string st = "abddzsCÅbÅCdzsdzsba";
CompareInfo ci = CompareInfo.GetCompareInfo("hu-HU");
byte rgbyt = ci.GetSortKey(stIn, CompareOptions.None).KeyData
and then take a look at what those bytes look like:
0e 02 0e 09 0e 1e 0e 1e 0e 0a 0e 02 0e 09 0e 02 0e 0a 0e 1e 0e 1e 0e 09 0e 02 01 02 02 02 02 02 1a 02 1a 01 02 02 02 02 12 12 02 12 12 01 01 00
Let's split it into the various pieces:
UW: 0e 02 0e 09 0e 1e 0e 1e 0e 0a 0e 02 0e 09 0e 02 0e 0a 0e 1e 0e 1e 0e 09 0e 02 01
DW: 02 02 02 02 02 1a 02 1a 01
CW: 02 02 02 02 12 12 02 12 12 01
SW: 01 00
Now the rules are very simple for the detection:
- Strip out the 00 and 01 byte values, those are just sentinels;
- The UW values will always be two bytes per sort element, and will always therefore be an even number;
- The DW/CW/SW weights can be prefixed by 02 byte values. You can ignore them, and assume that identical weights exist in suffix form if the weight ends early (02 is the minimal weight and is just used to pad the piece of the sort key when are looking further into the string).
So, looking at our sort key:
UW: 0e 02 0e 09 0e 1e 0e 1e 0e 0a 0e 02 0e 09 0e 02 0e 0a 0e 1e 0e 1e 0e 09 0e 02
DW: 02 02 02 02 02 1a 02 1a 02 02 02 02 02
CW: 02 02 02 02 12 12 02 12 12 02 02 02 02
Yes folks, we have a palindrome! And the method to do so is using code that has been around since NT 3.1.
Now if you really want to, you can normalize the string to take care of some of the issues brought up in the earlier posts.
The code to sort through the byte array and verify the results is left as an exercise for the reader. If someone had figured this out in an interview, they would have impressed the hell out of me; I am temporarily way too impressed with myself for thinking of it.... :-)
I will do some more "sort key cracking" another day, if there is interest. I have only ever known on one team inside or outside of Microsoft that has ever had a scenario that required cracking sort keys, but perhaps I have just not talked to everyone yet. I have run across many people who have found it to be an interesting area, even if they could not find an immediate use for the knowledge.
This post brought to you by "﴿" (U+fd3f, a.k.a. ORNATE RIGHT PARENTHESIS)
# Atsushi Enomoto on 15 Jun 2005 9:01 AM:
Hello. I doubt that 00 and 01 byte values are always just sentinels and UW are always two values. For example SortKey data for "U+1113" is 58 F7 FF 07 FF 00 FF 00 01 01 01 01 00 (a TextElementEnumerator created from "U+1113" returns just one string that contains single U+1113).
# Michael S. Kaplan on 15 Jun 2005 10:09 AM:
"Hello. I doubt that 00 and 01 byte values are always just sentinels"
They are. I know -- In the table that contains the weights, 2 is the smallest weight value ever given -- 1 must be reserved for the sentinels.
"and UW are always two values."
It is -- there are times that many characters mean one UW and many times that one character means many UW values. But the number of weights there will always be even and they always are two bytes each....
"For example SortKey data for "U+1113" is 58 F7 FF 07 FF 00 FF 00 01 01 01 01 00 (a TextElementEnumerator created from 'U+1113' returns just one string that contains single U+1113)."
Yes -- and Jamo are often getting more than one UW -- part of the state table logic for support of Old Hangul.
I did forget that the algorithm for the Old Hangul will occasionally insert single 0 bytes in there. But it will never insert a 1. It does not look like it is possible with valid Hangul syllables, only with incomplete Jamo sequences that do not make up a valid syllable....
This technically ought to be considered a bug anyway (some implementations assume the byte array is NULL terminated since it is doc-ed as such. I think I will consider this a bug and investigate it as such, in any case.
Another bug found through blogging? :-)
# serras on 15 Jun 2005 2:55 PM:
Just as a comment: since a few years ago, ch is no longer treated as a special letter (that followed c), but as C+H. The same goes to ll: it used to go after the l, now it is treated as L+L in dictionaries.
This tradicional sort has completely been replace, at least in Spain. However, I think in other places where Spanish is spoken (Latin America) it already is used.
But RAE (Royal Spanish Academy: the institution that speaks about our language) in conjuntion with the other Spanish Languages Acdemies in the world (there's even one in USA), are changung it.
# Michael S. Kaplan on 15 Jun 2005 3:45 PM:
Hi serras --
Except if you choose the Traditional Spanish sort (the modern Spanish one works as you say it does; the traditional one works the old way).
# Ruben on 15 Jun 2005 4:58 PM:
Unsurprisingly, the Dutch word "kijk" (look) is not seen as a palindrome. But that's probably because the ij is neither a text element, nor does it have any special sorting rules.
Having said that, word games always treat the ij as a single letter. For example, a cross word puzzle will insist that ijsvrij consists of 5 letters (or boxes). I'm pretty sure that also happens with palindromes (kijk, pijp, sijs, neppijpen, nijlkoortsmeetsysteemstrooklijn). Okay, so I looked some up.
Note however, that in your demo app, substituting ij (U+0069 U006a) with ĳ (U+0133) does seem to make it work, even with split ligatures set. Weird. But then again, there's no way to access that digraph/ligature from a keyboard, so nobody uses it anyway.
# Maurits [MSFT] on 15 Jun 2005 6:43 PM:
I believe that the letter "y" historically arises as a ligature for ij:
for more... for example, though dictionaries treat ij as two letters, encyclopediae and phone books treat it as one sort element
# Maurits [MSFT] on 15 Jun 2005 6:46 PM:
In Afrikaans kijk is spelled "kyk" (rhymes with "lake")
# Maurits [MSFT] on 15 Jun 2005 7:09 PM:
I'd have to guess that Windows doesn't consider the ĳ character (U+0133) to be a ligature of i and j... could this be considered a bug? Or is ĳ (U+0133) a letter in its own right, distinct from a ligature of i and j?
# Michael S. Kaplan on 15 Jun 2005 7:48 PM:
U+0133 has a compatibility decomposition to U+0069 U+006a, so the code I put in the third post should actually split it up okay. The Windows table is not as up to date (see http://blogs.msdn.com/michkap/archive/2005/01/31/363701.aspx
for more info on *that* issue; it is one that will be addressed in Longhorn, fwiw).
# Michael S. Kaplan on 15 Jun 2005 10:53 PM:
Hey Maurits --
"for example, though dictionaries treat ij as two letters, encyclopediae and phone books treat it as one sort element"
Interesting. AFAIK we have never gotten a request for a Dutch (Phonebook) sort or anything like that. I am sure we would have no problems with it; we do it now for German (Germany), after all....
# Maurits [MSFT] on 17 Jun 2005 4:41 PM:
This quote is tantalizing:
"... Several other letters of the Cyrillic alphabet were originally ligatures but are now considered to be single letters."
(The Cyrillic alphabet derives from the Greek alphabet)
I wonder if any letter combinations in present-day English are destined to become single letters? The popularity of digital data over handwritten data might make that impossible in this day and age.
# Ruben on 17 Jun 2005 5:37 PM:
Well, I guess it's the 'right' thing for the program to not magically recognize i+j as a single unit. How would it know that Dutch words like bijectie and bijouterieën don't contain a ij? This is demonstrated by their hyphenation points: bi-jec-tie and bi-jou-te-rie-en (note the drop of the diaeresis, or trema in Dutch, when hyphenating).
These exceptions are extremely rare though, and without exception loan words, so by making these exceptions the rule is probably still not really 'right' for native speakers (or typists).
Which shows that even sticking to European languages, means you'll meet many weird linguistic spelling rules that matter to natives, but baffle outsiders.
# Michael S. Kaplan on 17 Jun 2005 8:47 PM:
Hi Ruben --
Ah, that is not an intractable problem, but luckily one not needed by this program since it is not a palindrome. :-)
Seriously though, such cases can only happen via dictionaries either with the one case or the other, depending on the frequency and the scenario....
# Michael S. Kaplan on 17 Jun 2005 8:49 PM:
Maurits, anything is possible -- especially when you consider that fonts would be able to create the ligature as needed, even if they stay encoded as separate characters....
go to newer or older post, or back to index or month or day