Incomplete Scenarios: They don't know everything that's up with number sorting

by Michael S. Kaplan, published on 2007/12/22 10:16 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/12/22/6834163.aspx


In just a few weeks from now, it will have been three years since I first wrote about What is up with number sorting?.

Since then, that SHLWAPI function StrCmpLogicalW has come up a few other times, like in:

In the meantime, the basic issue of sorting digits contained in strings as if they are TEXT vs. as if they are NUMBERS (what I have always thought of as "sort digits as numbers") has had some recent airplay in various blogs:

And Jeff's post actually linked to a bunch of others as well.

All of the different algorithms I looked at all had some basic issues in common, irregardless (!) of language or platform. The main point being that none of them did anything with LEADING ZEROES.

(They also all had in common that none of them pointed to me except I think one regular reader from here mentioned posts here in a comment, but since I only raised issues and talked about a function, it kind of makes sense that posts focused on algorithms would not see the need to reference someone not providing an algorithm!)

Many of the people involved DID talk about LEADING ZEROES. But they did so only to contrast them as a technique to get the same results (which is to say that people worried about sorting 1000, 200, 30, 4 and have them come out as 4, 30, 200, 1000 could either write a function or make the numbers 1000, 0200, 0030, and 0004).

Nobody talked about what their algorithms would do when comparing things like 00003, 0003, 003, 03, and 3.

All of them have some kind of fundamental THREE-ness about them, and unsurprisingly most of them don't tend to fare very well since they are mostly focused on whether the sections of the two strings that are numbers are of the same length and whether they are equal numerically -- and length is taken as a pretty fundamental indication of which one is bigger -- so I suppose size matters a lot to them. :-)

StrCmpLogicalW does intentionally and deterministically handle the case of leading zeroes (for what it's worth I don't like the way it does break such ties, but I am a huge fan of deterministic behavior in such cases), though to be honest StrCmpLogicalW didn't come up very much in the various posts either.

Though perhaps if people had been willing to look a bit closer, then the LEADING ZEROES issue would have been noted and more widely handled.

And then there are the issues I have brought up previously that complicate the idea of this kind of functionality in NLS such as dealing with file extensions (handled with the Vista changes), figuring out what to do with other digits in Unicode (a ton of issues there!), and dealing with sort keys (especially difficult to tackle efficiently whether for numbers of unlimited size or of an arbitrarily limited size).

Given that those issues are fairly blocking from my old team being able to consider the functionality, in addition to the basic need to handle LEADING ZEROES properly, I would have loved to have seen a bit more thought on the issues that people have not tackled yet on the "my algorithm is better than yours" frontier if so many different people were going to be thinking about the poblem....

 

(Hat tip to Mike Gunderloy)

 

All of the characters in Unicode have taken off for Grand Cayman for the Christmas holiday weekend
(they are staying at the Mariott Grand Cayman Beach Hotel in case you are there and are curious at all the characters hanging out by the pool!)


# Thomas G. Mayfield on 6 Jan 2008 9:46 PM:

I would expect 00003, 003, 03, and 3 to sort as being equal. How would you expect them to be handled?  Only thing I'd suggest other than being equal would be that leading 0s push it to the top, which would make the sort deterministic.

# Michael S. Kaplan on 6 Jan 2008 10:51 PM:

The various "solutions" these other people provided don't handle it -- they are purely length based for numbers, so 0003 comes after 003 since the former is longer (more importantly, 0003 comes after 004 since it is longer.

The Shell solution is different -- better, but still wrong in my opinion. At least it is deterministic and the results make more sense.

Michael S. Kaplan on 10 Jan 2008 5:21 PM:


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/03/04 Consistency in the Windows Shell is not overrated; it's just underobserved!

go to newer or older post, or back to index or month or day