Everybody's doing the wraparound....

by Michael S. Kaplan, published on 2006/03/07 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/03/07/545097.aspx

(computerized apologies to Ray Charles for the title of the post!)

Will anyone forget when I asked the question What do you get when you combine a base character with a buttload of diacritics?

I was of course talking about fonts there. This time I am going to take a slightly different approach, and talk about collation.

I will give the string, the code points, and the sort key. We'll start simply, with one letter:

e
U+0065
0e 21 01 01 01 01 00

Now we will go with something a little more complicated (the difference from above marked in RED):

ẽ
U+1ebd
0e 21 01 19 01 01 01 00

or its alter ego in normalization form D:

ẽ
U+0065 U+0308
0e 21 01 19 01 01 01 00

Hmmm... let's look at another diacritic:

ê
U+00ea
0e 21 01 12 01 01 01 00

ê
U+0065 U+0302
0e 21 01 12 01 01 01 00

Ok, and now for the kicker:

ễ
U+1ec5
0e 21 01 29 01 01 01 00

ễ
U+0065 U+0302 U+0303
0e 21 01 29 01 01 01 00

But wait -- where did the 29 come from? I mean the first one had no DW (diacritc weight), and the next two had 19 and 12, respectively.

I had talked in previous posts about sort keys about how the minimal weight is 2, but that this weight would only be seen when it was needed as a placeholder, e.g. in the following string:

eễ
U+0065 U+1ec5
0e 21 0e 21 01 02 29 01 01 01 00

So, if you take that (sometimes invisible) 2 that as there on the 'e' always and combine it with the 17 on the tilde and the 10 on the circumflex, you get 29.

Easy.

Now what happens when you get that buttload of diacritics? Let's add them one at a time:

U+0065
0e 21 01 01 01 01 00

U+0065 U+0300
0e 21 01 0f 01 01 01 00

U+0065 U+0300 U+0301
0e 21 01 1b 01 01 01 00

U+0065 U+0300 U+0301 U+0302
0e 21 01 2b 01 01 01 00

U+0065 U+0300 U+0301 U+0302 U+0303
0e 21 01 42 01 01 01 00

U+0065 U+0300 U+0301 U+0302 U+0303 U+0304
0e 21 01 57 01 01 01 00

U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305
0e 21 01 95 01 01 01 00

U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306
0e 21 01 a8 01 01 01 00

U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307
0e 21 01 b6 01 01 01 00

U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+0308
0e 21 01 c7 01 01 01 00

U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+0308 U+0309
0e 21 01 06 01 01 01 00

U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+0308 U+0309 U+030a
0e 21 01 1e 01 01 01 00

U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+0308 U+0309 U+030a U+030b
0e 21 01 39 01 01 01 00

U+0065 U+0300 U+0301 U+0302 U+0303 U+0304 U+0305 U+0306 U+0307 U+0308 U+0309 U+030a U+030b U+030c
0e 21 01 4b 01 01 01 00

Uh oh! Eventually we wrap....

We only have one byte of space to store that diacritic weight (any more than a byte would run into the next character's byte), and when we run out there were really only three choices:

We could stop at 0xff and simply not added any more after that;
We could stop at 0xff and given an error;
We could wrap around.

The problem with #2 is that it pretty sharply limits what one could do in a potentially unpredictable way, and the problem with #1 is that all such strings would be equal. Now with option #3 there is a good chance that there will be a difference between strings being compared, though it will sometimes unfortunately make a string that is clearly greater than another string feeling like it is less than it -- a cure that may be worse than the disease....

Well, I won't argue whether one of the other choices might have been better; we are kind of stuck with it now (there are technically a few cases that wrap that are less theoretical than the case above, lest you try to dismiss the example as being a bit too unrealistic!).

But at least that answers the question about what happens when you try to collate a buttload of diacritics....

This post brought to you by "e" (U+0065, a.k.a. LATIN SMALL LETTER E)

Ben Bryant on 7 Mar 2006 7:32 AM:

Would it be correct to say it doesn't really matter what you (MS) do, just that you do something, since this is an edge case not to occur in practical data.

Michael S. Kaplan on 7 Mar 2006 10:08 AM:

Hi Ben,

Well, I would not want to say it does not matter, since there are some actual realistic cases where we wrap around. And if one was unlucky enough to be caught in those situations, then one would be very unhappy if I was just dismissive about the situation.... :-(

But I *would* say that there is good an bad in all three possibilities.

I can usually convince myself that any of the three choices is the "best" choice, usually on the basis of a paticular bug report in one or both of the other alternatives!

Maurits [MSFT] on 7 Mar 2006 11:13 AM:

... why, exactly, do you only have one byte?

Michael S. Kaplan on 7 Mar 2006 4:29 PM:

It is due to the actual binary format of sort keys, which we cannot change....

Maurits [MSFT] on 7 Mar 2006 4:59 PM:

... um... why not? Sorry if I'm being a pest.

Michael S. Kaplan on 7 Mar 2006 11:35 PM:

Hi Maurits,

A pest? No, not at all. :-)

The binary format of sort keys is important to people who need to "unpack" the keys, which is possible and there are people who depend on the ability. If we were able to support more than one byte of DW per sort element we would need sentinels or some other mechanism to support such a change.

Maurits [MSFT] on 29 Mar 2006 11:48 AM:

I just thought of a way to use sentinels in a way that would be backwards-compatible. I'll try to post it today.

Maurits [MSFT] on 29 Mar 2006 1:57 PM:

Hmmm... I just thought of a potentially serious problem with the third method (wrap.) What if the value wraps to 0xfe or 0xfd? Then when you add 2, you get 01 or (worse?) 00.

In particular:

"e" with 15 tildes will have a DW of 15 * 17 which comes to 255; add 2 and you wrap to 1.

"e" with 5 circumflexes and 12 tildes will have a DW of (5 * 10) + (12 * 17) = 254; add 2 and you wrap to 0.

On the other hand, what are the odds of encountering a string with such a diacritically-loaded character?

Michael S. Kaplan on 29 Mar 2006 2:05 PM:

Yes, I agree that is pretty bad, too. Let me give this one some thought....

Maurits [MSFT] on 29 Mar 2006 2:27 PM:

OK, here's my suggested encoding mechanism in the form of a C# console app:

http://www.geocities.com/mvaneerde/diacritic-weight-encoder.txt

Michael S. Kaplan on 29 Mar 2006 4:02 PM:

Hi Maurits!

Interesting -- though not backwards compatible -- since no one can crack this sort key without building in explicit knowledge of the format change.... :-)

Maurits [MSFT] on 29 Mar 2006 4:15 PM:

Perhaps backwards-compatible was the wrong term. I meant that it generates the same sort keys that the current method does for most data (non-pathological data.)

Michael S. Kaplan on 29 Mar 2006 4:25 PM:

Right, understood -- but this would still be a known format change, and one that would impact comparisons of sort keys (i.e. it would not always allow comparisons to work the same between comparing sort keys and strings....

Maurits [MSFT] on 29 Mar 2006 4:41 PM:

Indeed it would be :)
I was trying to generate an example string that wraparounded (ugh) to 00 or 01, but I couldn't (on Windows 2000)
Maybe I'm trying to solve a problem that doesn't exist?

Maurits [MSFT] on 29 Mar 2006 6:01 PM:

> one that would impact comparisons of sort keys...

Not sure I understand this. As I see it, it would /fix/ comparisons of sort keys.

Given a set of strings of the same UW but with various DW, the current algorithm sometimes sorts correctly and sometimes incorrectly. In fact, sometimes it considers strings to be equal that are not equal (any time the difference in the DW is 256.)

The proposed algorithm sorts the strings in DW order correctly in all cases, AFAICS.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/09/07 Refusing to ignore some particular character's width isn't [always] an act of discrimination…

2007/09/12 A&P of Sort Keys, part 2 (aka The string that won? Didn't have a mark on him!)

2007/09/10 A&P of Sort Keys, part 0 (aka The empty string sorts the same in every language)

2006/09/19 Put in on my Tab, please

2006/05/31 Keeping out the undesirables?

2006/03/30 If at first you don't succeed, there's probably still a bug

go to newer or older post, or back to index or month or day