If at first you don't succeed, there's probably still a bug

by Michael S. Kaplan, published on 2006/03/30 00:01 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2006/03/30/564598.aspx

Back near the beginning of the month, when I posted about how Everybody's doing the wraparound...., I talked about how the particular implementation of multiple combining diacritic weights led to an interesting situation on Windows where the secondary, a.k.a. DW or 'diacritic weight', weight would wrap around.

Regular reader Maurits suggested a real problem that seemed to be possible:

Hmmm... I just thought of a potentially serious problem with the third method (wrap.) What if the value wraps to 0xfe or 0xfd? Then when you add 2, you get 01 or (worse?) 00.

In particular:

"e" with 15 tildes will have a DW of 15 * 17 which comes to 255; add 2 and you wrap to 1.

"e" with 5 circumflexes and 12 tildes will have a DW of (5 * 10) + (12 * 17) = 254; add 2 and you wrap to 0.

On the other hand, what are the odds of encountering a string with such a diacritically-loaded character?

This was before he had actually tested things out -- once he did, his report was slightly different:

I was trying to generate an example string that wraparounded (ugh) to 00 or 01, but I couldn't (on Windows 2000)

Maybe I'm trying to solve a problem that doesn't exist?

Well, let's back up a second....

He is indeed right, the sort keys never seem to get created with bogus 0x00 or 0x01 weights inserted (therefore the bug Atsushi Enomoto reported still stands alone as the bug that inserts one of these byte sequences when it is not expected!).

(That was another bug found through blogging!)

But that does not mean there is no bug in this case.

Now I suspected there might be a bug here after Maurits reported the original problem -- and after looking at it tonight I managed to find one....

By carefully applying Tester's Axiom #5 and the knowledge that there is a potentially incorrect issue with the implementation, what can we find out here?

Well, if you look at How to track down collation bugs from late last year, you will notice a couple of things:

Most of the points in that post do not apply, and
Point #12 leads to some interesting results

Take the following sample code:

using System;
using System.Globalization;

namespace SortTest {
    class Class1 {
        [STAThread]
        static void Main(string[] args) {
            CompareInfo ci = CompareInfo.GetCompareInfo(0x007f);
            string stOld= "e";

            for(int i = 0; i < 7; i++) {
                string st = "e\u0323\u0323" + ((char)(0x031d + i)).ToString();
                Console.Write(string.Compare(stOld, st));
                Console.Write("\t\t");
                Console.WriteLine(SortKey.Compare(ci.GetSortKey(stOld), ci.GetSortKey(st)));
                stOld = st;
            }
        }
    }
}

The output of this little function (which should be two rows of numbers, where each row contains two values equal to each other) is worth 1000 words....

-1              -1
-1              -1
-1              1
-1              0
-1              0
-1              -1
-1              -1

What are the sort key values here? They are:

\u0065\u0323\u0323\u031d        0e 21 01 fe 01 01 01 00
\u0065\u0323\u0323\u031e        0e 21 01 ff 01 01 01 00
\u0065\u0323\u0323\u031f        0e 21 01 01 01 01 00
\u0065\u0323\u0323\u0320        0e 21 01 01 01 01 00
\u0065\u0323\u0323\u0321        0e 21 01 01 01 01 00
\u0065\u0323\u0323\u0322        0e 21 01 03 01 01 01 00
\u0065\u0323\u0323\u0323        0e 21 01 04 01 01 01 00

So, it looks like functions like CompareString (and its managed equivalent) will consistently wrap around and give expected results, while LCMapString for sort keys and its managed equivalent will, in an effort to avoid the problem with embedded invalid bytes will occasionally drop them.

Three bytes are dropped:

0x00 -- reserved for the end of the string;
0x01 -- reserved for a sentinel between subsections of the sort key;
0x02 -- reserved for the minimal weight with no diacritics, which is dropped when it is not needed as a placeholder for later secondary weights that greater than 0x02.

As soon as I read that comment, I knew there was a bug here. :-)

(Yet another bug found through blogging!)

This post brought to you by " ̣" (U+0323, a.k.a. COMBINING DOT BELOW)

# Maurits on Thursday, March 30, 2006 12:01 PM:

I've found a minimal string that exhibits this bug:

"e\x0340\x0345"

0x0340 is COMBINING GRAVE TONE MARK (Vietnamese)
http://www.fileformat.info/info/unicode/char/340/index.htm

0x0345 is COMBINING GREEK YPOGEGRAMMENI (Greek)
http://www.fileformat.info/info/unicode/char/345/index.htm

Sort keys:

e\x0340 sort key: e 21 1 [70] 1 1 1 0

e\x0345 sort key: e 21 1 [92] 1 1 1 0

e\x0340\x0345 sort key: e 21 1 [] 1 1 1 0

(70 - 2) + (92 - 2) + 2 = x100 - wraps to 0.

I notice U+0349 through U+036f don't have any DW at all on W2K.

# Maurits on Thursday, March 30, 2006 12:21 PM:

There are only two diacritic pairs (so far, as of W2K) that wrap to 0, 1, or 2:

e\x340\x345: wraps to 0 (as above)
e\x341\x345: wraps to 1

0x0341 is COMBINING ACUTE TONE MARK (Vietnamese)
http://www.fileformat.info/info/unicode/char/341/index.htm

To wrap to 2 takes at least three diacritics in the same text element.

# Maurits on Thursday, March 30, 2006 12:34 PM:

> functions like CompareString (and its managed equivalent) will consistently wrap around and give expected results

I have to disagree... it's not wrapping at all. If it had wrapped, one of the numbers in the first column would have been 1.

# Maurits on Thursday, March 30, 2006 12:46 PM:

Another way the bug sneaks in is to look at pure-diacritic strings... these don't have the free 0x2 from the base character (e)

These three strings of length two are all incorrectly equal to "":

\x0342\x0346
\x0342\x0347
\x0343\x0346

# Michael S. Kaplan on Thursday, March 30, 2006 1:09 PM:

We may need to develop a max. number of comments per reader? :-)

I should likely have said "wraps cleanly" rather than just "wraps", as this is what the example showed....

As for "diacritic only strings", they are meaningless and therefore prove nothing. :-)

# Maurits on Thursday, March 30, 2006 1:22 PM:

How is CompareString doing the comparison, if it's not using sort keys?

I could see ameliorating the 00/01/02 issue by doing something like

while (dw >= 0x100 || dw <= 0x2) { dw = dw % 0x100 + 03; }

... but the columns would still be different. You'd have:

-1 -1
-1 -1
-1 1 (LCMapString wraps... to 03, now, not to 00)
-1 -1 (to 04, now, not to 01)
-1 -1 (to 05, now, not to 02)
-1 -1

So LCMapString would only be wrong in one place instead of 3.

But how does CompareString get it right??

# Michael S. Kaplan on Thursday, March 30, 2006 1:34 PM:

CompareString is not exactly using sort keys -- what it ias doing is actually very complex, I'll talk about it in a future blog post....

# Maurits on Thursday, March 30, 2006 1:59 PM:

"something like" should be

// need "while" rather than "if"... consider 0x1ff
// it will only iterate twice at most
while (dw >= 0x100) { dw = dw % 0x100 + 0x03 * (dw / 0x100); }

This is as-monotone-as-possible (wraps are as far as possible apart, monotone in between wraps)

# Maurits on Thursday, March 30, 2006 3:02 PM:

BINGO!!

I just wasn't trying hard enough. Here's a string that yeilds a sort key with an embedded 0, due to diacritic weight wrapping.

String: e\x0340\x0345e\x0340
Sort key: (e 21 e 21) 1 [0 70] 1 1 1 0

And here's one with an embedded 1:
String: e\x0341\x0345e\x0340
Sort key: (e 21 e 21) 1 [1 70] 1 1 1 0

There's probably a way to make an embedded 2, but that's not such a big deal.

Note that I had to use my phenomenal psychic ability to decide where to put the []s in the second sort key. ;)

# Maurits on Thursday, March 30, 2006 3:11 PM:

It hit me that LCMapString only dropped TRAILING 02 weights in any category. If it only dropped trailing 02's, it probably only dropped trailing 01's and 00's too. Sho'nuff...

# Michael S. Kaplan on Thursday, March 30, 2006 3:24 PM:

Now you're thinking like a[n international] tester! :-)

# Maurits on Thursday, March 30, 2006 4:00 PM:

Here's a couple sample strings with just tildes and circumflexes. (Note my math is wrong in the second quoted statement in your OP... I was misreading hex as decimal)

string: e(15 tildes)e(tilde)
sort key: (e 21 e 21) 1 [1 13] 1 1 1 0

string: e(1 circumflex)(14 tildes)e(tilde)
sort key: (e 21 e 21) 1 [0 13] 1 1 1 0

(Tongue-in-cheek mode = ON)
Q: What do you get when you combine a base character with a butt-load of diacritics?
A: Sometimes, you get a malformed sort key!

# Maurits on Thursday, March 30, 2006 4:17 PM:

Last one...

string: e(16 circumflexes)e(tilde)
sort key: (e 21 e 21) 1 [2 13] 1 1 1 0

Note this is the same as

string: ee(tilde)
sort key: (e 21 e 21) 1 [2 13] 1 1 1 0

# Michael S. Kaplan on Thursday, March 30, 2006 4:26 PM:

> Last one...

Seems doubtful. :-)

The key here is looking for useful strings, of course.

In both the font example and the sorting examples, I am showing holes based on taking an implementations past their reasonable breaking point.... the nature of bugs here are essentially "do something really unreasonable, and then end up with bad results."

# Maurits on Thursday, March 30, 2006 6:30 PM:

... all right, maybe one more. Or two.

> looking for useful strings

Well, here's the list of the 747 two- and three-diacritic sets that overflow to 0 or 1 when combined with a base element:

http://www.geocities.com/mvaneerde/dw-overflows.txt

This only considers the 72 combining characters from U+0300 to U+0347 because the other combining characters I tried (U+0348 to U+036F) had zero DW.

Considering other characters with diacritic weights would expand the list.

The task is now, it seems, to find one of the 747 triples where all three combining characters are from the same region/discipline*, and all fit on the same character simultaneously. I'm betting that if such a triple exists, it will be from the International Phonetic Alphabet discipline.

* also allowing neutrals, like U+0332 COMBINING LOW LINE (poor man's underline?)

# Maurits on Thursday, March 30, 2006 6:59 PM:

It seems at least one person has posted on how to use combining characters to decorate otherwise plaintext web posts:

How to create s̶t̶r̶i̶k̶e̶t̶h̶r̶o̶u̶g̶h̶ and u̲n̲d̲e̲r̲l̲i̲n̲e̲d̲ text
http://www.epinions.com/content_3826622596

(As I post this, "strikethrough" in the title above is -struck-through-, and "underlined" is _underlined_)

That raises the danger level somewhat.

For example, this almost seems sensible (from list of 747 above:)
229)
e
0x301 COMBINING ACUTE ACCENT
0x332 COMBINING LOW LINE
0x345 COMBINING GREEK YPOGEGRAMMENI
wraps to 1

Greek letters can have accents; and small following iotas;* and if this were in an epinions post by someone who read the underline tutorial, it could be underlined.

If there were another DW character later in the string (or the word,) the sort key for the post (or the word) would have an embedded 1.

If epinions' full-text search engine used sort keys, the bug would be triggered.

* U+1FF4 GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENI http://www.fileformat.info/info/unicode/char/1ff4/index.htm

So here it is: ῴ̲

Make a string with that text element, and another accented character anywhere after it, and you get a sort key with five 1's.

String: \x03c9\x0301\x0345\x0332 \x03c9\x0301
Sort key: (f 19 7 2 f 19) 1 [1 2 e] 1 1 1 0

# Michael S. Kaplan on Thursday, March 30, 2006 7:50 PM:

I thought you were done?

Told you.... :-)

However, strikethough and underlining is for higher level markup, not plain text. And the need to not only misuse Unicode by doing such things but to need to index it with sort keys is a fair indication of piling on misuse of multiple technologies....

It all gets back to that "meaningful string" issue. Truly!

# Maurits on Thursday, March 30, 2006 8:08 PM:

'K, found another one.

615)
e
0x305 COMBINING OVERLINE
0x32c COMBINING CARON BELOW
0x332 COMBINING LOW LINE
wraps to 0

Hear me out.

Suppose someone wants to use a username that contains a COMBINING CARON BELOW. (Why? who knows.)

And let's suppose they stumble onto this "HOWTO: get a box around your name:"
http://www.mpcforum.com/showthread.php?t=105521

And let's further suppose that the caron-below is on any character in their username but the last one.

EVERY username that satisfies those three conditions trips the bug, because all of the characters after the one with the caron have DW.

# Maurits on Thursday, March 30, 2006 8:11 PM:

My name in a box: [M̬̅a̬̅u̬̅r̬̅i̬̅t̬̅s̬̅]

# Maurits on Thursday, March 30, 2006 8:14 PM:

That would have worked out better if I'd used LOW LINE instead of CARON BELOW (oops)

Take two: [M̲̅a̲̅u̲̅r̲̅i̲̅t̲̅s̲̅]

# Michael S. Kaplan on Thursday, March 30, 2006 8:16 PM:

Okay, you have convincingly established what I have already said -- if people do stupid things, then bad things will happen.

Easy fix -- don't do stupid things!

You can find as many examples as you want, but I am not sure it actually helping anything, or making anything more readable or understandable....

Try to focus on ACTUAL TEXT needed for language support, without these IMHO lame websites that give bad advice about trying to put in plain text that which belongs in rich text.

This is what the weights were optimized for....

# Phylyp on Friday, March 31, 2006 5:41 AM:

This is looking too much like a dialog, so for the sake of other readers, I just wanted to enjoy seeing a different name for the commenter!

# Tsuyoshi Ito on Sunday, April 02, 2006 12:59 PM:

Hi,
This looks more than a wraparound issue to me. You are doing a wrong thing when you add "values" of combining characters to calculate a sort key.
For example, two strings U+0063 U+0307 U+030D and U+0063 U+0308 have the same sort key (0E 21 01 13 01 01 01 00). According to the mechanism to calculate sort keys you explained in the earlier entry "Everybody's doing the wraparound....", apparently this has nothing to do with wraparound.
Moreover, string.Compare returns 0 when you compare these two strings. In other words, string.Compare("e\u0307\u030D", "e\u0308", true, CultureInfo.InvariantCulture) gives 0.
I obtained these results by a small C# program written with Visual Studio 2005 on Windows XP SP2 Japanese. I am not familier with the .NET CLI environment, but this is not desirable, is it?

By the way, I have been reading your blog for a few months but this is the first time I post a comment. Let me thank you for writing many interesting topics. I especially like Every Character Has a Story entries.

- Tsuyoshi

# Michael S. Kaplan on Sunday, April 02, 2006 3:59 PM:

Hello Tsuyoshi-san,

There are of course other issues that come into play here -- in the case you pointed out it is the fact that each combining mark has a specific DW and the numbers under 0307 + 030d happen to equal 0308.

Now that is not the wraparound issue, but it is the same basic issue with weights.

I should probably cover the wider topic a bit more, since it is potentially a lot more relevant to people than the admittedly rare wraparound issue....

Glad you are enjoying the blog! :-)

# Maurits on Monday, April 03, 2006 12:15 AM:

Wow... to solve that issue you'd almost need to give each diacritic its own byte in the sort key. Might need sentinels anyway to mark multiply-diacritic'd characters so that they sort correctly:
e(e'^) < (e')(e^) < (e'^)e

Currently:
... 01 [2 2+'+^] 01 ... < ... 01 [2+' 2+^] 01 ... < ... 01 [2+'+^ 2] 01 ...

But without being able to add diacritics... hmm... sentinels to the rescue...

... 01 [2 ff 2 ' ^] 01 ... < ... 01 [ 2+' 2+^] 01 ... < ... 01 [ ff 2 ' ^ 2 ] 01 ...

where the "ff 2" means:
ff: this character has multiple diacritics
2: this character has, in particular, two diacritics
' and ^ are the diacritic weights

The sort key is much longer this way, alas.

Question: reorder diacritics that are on the same base character in some standard fashion? If so, that would allow e'^ == e^'.

# Michael S. Kaplan on Monday, April 03, 2006 12:20 AM:

Well, luckily that sort of problem stays theoretical since I do not know of any attested single language usage of both orders of two diacritics on a letter. :-)

# Tsuyoshi Ito on Monday, April 03, 2006 1:42 AM:

It seems the problem is I did not know what the invariant culture is. I should have searched "invariant culture" on your blog and read any entries before I post. Sorry for a noise.

My current understanding is that if there is a culture where both "e\u0307\u030D" and "e\u0308" are used, then they have differnt sort keys and they compare as different strings in that culture.

(And I meant string.Compare("e\u0307\u030D", "e\u0308", false, CultureInfo.InvariantCulture) when I wrote true instead of false.)

# Michael S. Kaplan on Monday, April 03, 2006 2:10 AM:

Hi Tsuyoshi-san,

I don't consider it noise -- at the very least, what you are pointing out is a more realistic case than most of the "wrap-around" issues, if for no other reason than it is easier to hit.

Your understanding is correct -- if a language is being supported that makes such a distinction, then we do the work to make sure that collation will support the requirement. But it still makes for an interesting case, I think....

# Tsuyoshi Ito on Monday, April 03, 2006 4:02 PM:

Hi Michael, (please call me Tsuyoshi if you like)

Thank you for the clarification. As you know, I did not know what string.Compare do or what the invariant culture is for when I wrote the first comment. Now I know:
- string.Compare(string, string, boolean, CultureInfo) performs a linguistic comparison, which means the two strings must be meaningful in the specified culture; otherwise, the result will be unpredictable or at least difficult to interpret.
- Not every string is considered meaningful in the invariant culture.

Therefore...
http://blogs.msdn.com/michkap/archive/2004/12/29/344136.aspx
> That problem remains to this day, though every single time I speak at a conference or answer a question in a newsgroup or get someone to look at posts like this one, then there is at least one less developer who has this problem. Maybe this time it is you? :-)
Yay, it's me!

...sigh. I will never confuse the linguistic comparison in the invariant locale/culture with the ordinal comparison!

# Michael S. Kaplan on Monday, April 03, 2006 5:17 PM:

Hi Tsuyoshi,

Great to hear -- welcome to the club!

(we ought to get shirts or something)

# Maurits on Monday, April 03, 2006 7:36 PM:

how about this for a T-shirt: http://tinyurl.com/rjxyt

# Michael S. Kaplan on Monday, April 03, 2006 7:52 PM:

I actually own it. But I would not call it a T-shirt for *this* club! :-)

# Maurits on Monday, April 03, 2006 8:12 PM:

Ah, good point.

Maybe this then: http://tinyurl.com/rzag5 :)

# Maurits on Monday, April 03, 2006 8:13 PM:

grrr... stupid hotlink restrictions.
http://www.geocities.com/mvaneerde/einstein.jpg

# Michael S. Kaplan on Monday, April 03, 2006 8:39 PM:

Ok, that is flattering. :-)

I am not sure I could really wear it, but if anyone walked up to me wearing that on a T-shirt I would certainly (assuming my head didn't explode!) try to buy that person a drink....

# Maurits on Wednesday, April 05, 2006 2:35 PM:

> attested single language usage of both orders of two diacritics on a letter

In physics, a'^ and a^' are two different things[1], but that probably doesn't count as a language.

Edward John Garrett, in his paper about the IPA[2] gives the tantalizing sentence "the order of diacritics is *often* irrelevant" (emphasis mine)

... but fails to give a counterexample as to when the order of diacritics *is* relevant. That doesn't mean there isn't a counterexample, but it would have been nice if he'd been explicit.

*ugh*

It would be much better if diacritic order *did* matter and there was a clear case we could point to. This feels like a trap... (1) invest in a structure of commutative diacritics; (2) five years later some language comes up with a diacritic construct that needs to be uncommutative; (3) tear down the structure and build a new one at terrible cost.

[1] a'^ is the direction an object is moving (') in three-dimensional space, normalized (^) to a unit vector. a^' is the direction (') an object APPEARS to be moving when viewed from the origin (^). If an object is at (1, 0, 0) and is moving away from the origin, a'^ is (1, 0, 0) and a^' is (0, 0, 0).
[2] http://altiplano.emich.edu/~egarrett/papers/ejg_CICLing2005.pdf

# Michael S. Kaplan on Wednesday, April 05, 2006 2:42 PM:

Hi Maurits,

No need to tear anything down -- any time a language has a requirement, we can customize the sort under that language's LCID to mak sure that there would be no break....

So no need to tear down anything. :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/10/09 A&P of Sort Keys, part 13 (About the function that is too lazy to get it right every time)

go to newer or older post, or back to index or month or day