That is not actually a bug in sort keys....

by Michael S. Kaplan, published on 2005/01/01 03:48 -05:00, original URI:

One of the 14 people who read this blog was reading my recent post "How do sort keys work?" and he thought that he may have found a bug. He didn't, but it was an interesting point that he brought up. So I thought I'd mention it here.

(Note to all -- never be afraid to post a comment, neither bugs nor mistakes will ever embarass me!)

His question, and the answer, follow below.

Your post and the Platform SDK list a specific format for sort keys

[all Unicode sort weights] 0x01 [all Diacritic weights] 0x01 [all Case weights] 0x01 [all Special weights] 0x00

But using a single 0x01 byte to separate the different weights seems very dangerous. Couldn't somebody accidentally pick a string that happens to have a single byte in one of the weights that would mimic this section ending marker? This could cause two strings to be considered equal even if they are not, couldn't it?

It is a very good question, but the answer is (thankfully) that it is not possible to hit that situation. The rules for the byte values in the sort key data we provide are such that we never allow a byte with only 0x01 in it. This was the only way we could make sure to always have an unambiguous section terminating marker.

This is obviously a tough loss. When you only have 256 possible values in each of those bytes, it is expensive to lose one of them (well, actually you lose two since we cannot allow the 0 either!). But its obviously required or else we would hit the very potential bugs that occurred to you....


This post sponsored by "" (U+0001, a.k.a. <control>, a.k.a. START OF HEADING)
Note that this control shouldn't really look like much of anything except maybe accidentally a graphical block character on some versions of Windows -- he just did not think he'd have another chance to be relevant!)

no comments

go to newer or older post, or back to index or month or day