What is up with number sorting?
by Michael S. Kaplan, published on 2005/01/05 10:41 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/05/346933.aspx
Earlier this morning, Peter Ibbotson asked me:
Perhaps you explain something to me (I appreciate you may be the wrong person to ask, but you seem to be a sorting expert). When XP was in beta I put a bug report on this odd sorting behaviour with explorer, given a directory containing the following files:
C:\sorttest>dir /on /b
X12.TXT
X13.TXT
X1A.TXT
X1B.TXT
XAB.TXT
Shows in this order (when sorted by name) in explorer:
X1A.TXT
X1B.TXT
X12.TXT
X13.TXT
XAB.TXT
I've never understood why this happens (and it drives me nuts sometimes) Oh I'm in the UK in case that makes any difference.
I can speak to this question. :-)
The first method is the method that has existed for ages on computers and with which a geek such as myself is likely to be most comfortable -- treating numbers as bits of text, each individual number of which is compared to an individual letter. You can get it by calling CompareString or LCMapString, and everything is great.
Unfortunately, it is also incredibly nonintitive to all of the people who are not geeks.
Try explaining to your mom that 30.txt is supposed to come after 100.txt. Its not a pretty sight:
Mom: It just makes no sense to me, dear. 30 is less than 100!
Son: Yes, when it's a number. But what about when you treat it as text?
Mom: How can a number be text?
Son: Well, anything can be text. There is a difference between 4 and "4".
Mom: <shakes head> It just makes no sense to me, dear.
And the outrageous thing is that she is actually right.
Now the Shell team at Microsoft is pretty amazing (and not just the famous ones like Raymond Chen).
There are those who whine when things other people do are not able to support their customers. That is not the Shell team, though. They are the ones who go write the code to support those customers. Thats pretty impressive to me.
Anyway, they have been getting feedback on this scenario for some time. And they finally did something about it. That API is StrCmpLogicalW, and it basically recognizes that 95% of the world treats the digits as numbers rather than text. And therefore it notices that 30.txt comes before 100.txt without having to zero-pad the numbers.
I have looked at the source for that function and every once in a while I even would use the problem as an interview question after seeing the clever solution they used (no, I do not use it anymore!). but it is a good solution to the problem of extending a function like CompareString to treat digits as numbers rather than text.
Of course the Windows Explorer, which is the Shell team's most important public face, uses this (unless you turn it off with group policy, which some people do).
In theory the command window could do the same in the Dir function, because as Peter pointed out the inconsistency is a tad noticable.
Other people ask about hex digits, which would be easy if the letters ABCDEF did not also have part time jobs as actual letters that must always be treated as text. Luckily this can also be answered by the core scenario that drove the API -- the type of person who is confused by the "digits as text" behavior is probably not using hexidecimal digits.
Thinking to my own areas of expertise it seems like a shame that ASCII 0-9 are handled but the other 20+ streams of digits, including the fullwidth to 0 to 9. Of course in some cases it is not really known what people would prefer, and as a function that it is in its heart of hearts designed to be intuitive its probably important to understand the expectations prior to blazing forth. Getting that feedback first would be important....
Plus I have to think about sort keys. They are pretty central to databases, and there is a design principle in our sort key implementation that LCMapString and CompareString should give back equivalent results in terms of comparisons. But creating sort keys that can handle strings of digits of unbounded size is not a trivial problem to consider (that too would maybe make a fun interview question but also a frustrating one so i don't think I'll try it).
Anyway, that's why one can get back different results across all locales when one looks at lists from the Dir function versus from Explorer. In fact, you may be able to use as a test of "geek-ness" how long it takes you to admit the truth in the overall scenario above (I have been "out" as a geek for years so I revel in the fact, but not every geek realizes it about themselves!).
This post sponsored by "௲" (U+0bf2, a.k.a. TAMIL NUMBER ONE THOUSAND)
"Some of my best friends are digits!"
# Tad on 5 Jan 2005 8:54 AM:
So what was there clever solution to sorting the numbers in
StrCmpLogicalW? Just curious...
# Michael Kaplan on 5 Jan 2005 9:01 AM:
Heh heh heh.... good question. Of course its not *my* clever solution so I don't want to make it sound like I am bragging (I'm not).
Let's just say that whoever wrote the code is a hire. :-)
# Dean Harding on 5 Jan 2005 2:42 PM:
One possible way would be to go through each file name and pad any numbers with zero up to a fixed length and then sort with the regular
CompareString function.
The only two problems with that would be 1. It's not very efficient to modify all the file names, especially for very long directory listings, and 2. It's difficult to know how much padding you need.
For 2. the obvious solution is to check the maximum length of a string of digits in all the file names and use that (so for example, if you have 3.txt and 1000.txt then you need to pad the first file with three zeros), but that just makes 1. more obvious (since you need to loop through all the file names once to find the maximum length of a string of digits, and again to actually pad digits).
For 1. I'm not sure how you could make it quicker, and still be able to use the normal CompareString to do the main work, because you'd /want/ to use the normal CompareString function otherwise you'd end up having to rewrite a very complex function for a fairly small gain.
# Michael Kaplan on 5 Jan 2005 3:50 PM:
Re: clever -- Raymond Chen pointed out to me a slightly less cool version of this code that used to be around which was much less clever. To be clear I am talking about the code that shipped with Server 2003 and then later in XP service packs.... :-)
To Dean -- think of another issue -- what about the file A1A1A1A1A1A1A1A1A1A1A1A1A1A1.txt? Do you extend every one of those numbers?
# Dean Harding on 5 Jan 2005 4:06 PM:
Ah, good point.
OK, another solution would be to sort it normally first. This'll get all the text-based ordering out of the way, and at least all the files with the same prefix and numbered suffixes will be together. So lets say you start off like this:
a.txt, b1.txt, def20.txt, b20.txt, def6.txt, b9.txt, c.txt
After you do your normal
CompareString sort, you'd have this:
a.txt b1.txt, b20.txt, b9.txt, c.txt, def20.txt, def6.txt
Then you can group them so that files with the same prefix followed by a digit are in a group, this'd give two groups:
b1.txt, b20.txt, b9.txt and
def20.txt, def6.txt
Then it's just a matter of extracting the number from the correct position in each file name, and sorting by that number in-place. So your two groups become:
b1.txt, b9.txt, b20.txt and
def6.txt, def20.txt
And the whole thing is:
a.txt, b1.txt, b9.txt, b20.txt, c.txt, def6.txt, def20.txt
This way, you don't modify strings, you don't have to worry about padding with zero, and you'll still get the correct answer (I think :)
# Norman Diamond on 5 Jan 2005 4:51 PM:
> Mom: How can a number be text?
Send her back for a refresher in grade 4 arithmetic, or somewhere around there. The difference between numbers and numerals was taught there, and numerals are text. Different numeral bases came up in grade 6, which should make it more obvious that numerals are text.
Nonetheless, sorting them as numbers instead of numerals is indeed agreeable 95% of the time. I only wonder, as you said you do, why Windows Explorer only does it around 50% of the time instead of 95% of the time.
Since there are some APIs now that recognize the numericity of various kinds of characters, one would think that Explorer would properly sort the department names of the company that I work in, but no I have to find the folder in order by the pronunciation of the number[*] instead of the numeric value. Dai-ichi-nani-nani, dai-san-nani-nani, dai-yon-nani-nani[*], dai-ni-nani-nani, 1 3 4 2 and the 4 isn't even in the expected place by pronunciation[*].
[* The matter of multiple pronunciations of a Kanji has been discussed in other threads. The character meaning four is getting sorted by pronunciation shi instead of yon, which is understandable but still not the first place one would look for it. I've never heard anyone say "dai-shi-nani-nani", it's "dai-yon-nani-nani".]
# Michael Kaplan on 5 Jan 2005 6:24 PM:
> Send her back for a refresher in grade 4 arithmetic, or somewhere around there.
Well, I will leave aside the fact that I would not talk to my mother that way, and point out that it would really not behoove Microsoft well to treat customers that way. So I will avoid that approach. :-)
> Since there are some APIs now that recognize the numericity of various kinds of characters
Well, the function in Windows today only handles 0123456789 -- the "ASCII digit" collection. It might be cool to do more, but you start running in to problems when you try to define what "more" should be since it is not always intuitive.
# Marcel on 5 Jan 2005 9:47 PM:
OK, I like hacks, so what about this:
Go through the strings and contract all consecutive digits into a single character, for example in the Unicode Private Use Area. After that sort the strings in a normal way (provided the sort routine sorts PUAs according to their value). Prepending the private character with a fixed '0' could also ensure that it is sorted in the place of ASCII digits.
X1A.TXT -> X0'U+E001'A.TXT
X13.TXT -> X0'U+E00D'.TXT
XAB.TXT -> XAB.TXT
# Norman Diamond on 5 Jan 2005 10:57 PM:
>> Since there are some APIs now that
>> recognize the numericity of various kinds
>> of characters
>
> Well, the function in Windows today only
> handles 0123456789 -- the "ASCII digit"
> collection.
Where did I read that Windows APIs were added that recognized Thai digits and various other code page digits. I'm sure I asked somewhere and received an answer that Windows Explorer is only designed to recognize the numericity of ASCII digits, but the function in Windows today is comparatively internationalized.
# Raymond Chen on 6 Jan 2005 12:24 AM:
Norman: Are you arguing that Explorer should sort files as follows:
two.txt
five.txt
sex.txt
elf.txt
three hundred.txt
?
# Michael Kaplan on 6 Jan 2005 1:13 AM:
Norman, re: Thai digits you are mistaken.
Speaking personally, I never claimed this API was interationalized, and I pointed out that doing so leads one to problems of non-intuitive behavior.
There are many functions in Windows that are well-internationalized. This not one of them, certainly when looking at other language digits.
# Michael Kaplan on 6 Jan 2005 1:18 AM:
Marcel -- interestng thought. Though one of the problems with that plan is that it puts digits in a wierd place compared to other characters (since the PUA has weight that puts it somewhere very unlike where numbers go).
The fact that these code points already have weight that needs to be primary between them will also lead to problems
And of course it still isn't an unbounded answer....
FWIW, I am not going to be hiring anyone based on their anwsers. But I am impressed at the quality of the answers. :-)
# Mike Williams on 6 Jan 2005 3:28 AM:
If you run a search on filenames via Explorer, the results always seem to come back in strict order (ie not using
StrCmpLogicalW) and also in lower-case for some reason. Can you shed any light on this? It's a little frustrating trying to filter a folder using a filename substring and having the results presenetd differently to a regular Explorer listing.
# Marcel on 6 Jan 2005 5:09 AM:
> Though one of the problems with
> that plan is that it puts digits
> in a wierd place compared to
> other characters
Therefore the prepended '0', which makes all those private values primarily sort into the same spot (of U+0030) and only secondarily use the PUA value.
But true, it's not unbound, though the range should be big enough to sort a few photos. Anyway, it was good enough for me considering I came up with it after a long night's work at 7am in my time zone ;-)
# Michael Kaplan on 6 Jan 2005 7:50 AM:
Hi Mike -- The Shell filtering functionality and the way help indexing works are both topics I will have to cover another day. I find them frustrating, too....
Marcel -- I understood that part, I was more interested in how you would handle the string that was engineered to look the same (i.e. sends those exact PUA values). I am not slsmming the idea though, I was just pointing out the problems -- so you wouldn't feel like it had not been thought of and discarded regretfully. :-)
# Norman Diamond on 6 Jan 2005 4:39 PM:
1/6/2005 12:24 AM Raymond Chen
> Norman: Are you arguing that Explorer should
> sort files as follows:
> two.txt
> five.txt
[...]
Good point. But I really thought I had read somewhere that Windows APIs had been augmented to recognize digits written in various linguistic character sets. And we both know that digits as written in Chinese characters can (and often are) strung together as digit sequences.
Full formal wording is a more difficult question since, even though each character is numeric, how do we say if the full formal construction is a numeral or a word phrase?
1/6/2005 1:13 AM Michael Kaplan
> Norman, re: Thai digits you are mistaken.
OK, I can't find where I read that Windows APIs had been augmented in that manner. On the other hand though, I saw a Thai use Thai digits and saw Windows accept them. And that was Windows XP Japanese (not Thai, not US with MUI, not anything else) with the input locale set to Thai.
# Michael Kaplan on 6 Jan 2005 4:59 PM:
Well digits are accepted. But there is no Windows API that readds them as numbers.
Either someone was mistaken when they spoke or you read them wrong. This applies to every version of Windows, no matter what the input locale is.
Perhaps you are talking about unrelated issues with GDI and rendering based on native digits in a locale, but if so that has nothing to do with collation.
# Larry Osterman on 10 Jan 2005 2:41 PM:
Btw, Valorie's been on me for literally years now about the fact that OE doesn't sort email messages with text in them.
She hates it that OE sorts:
This is message one of three
This is message two of three
This is message three of three
as:
This is message one of three
This is message three of three
This is message two of three
# Michael Kaplan on 10 Jan 2005 2:47 PM:
I just mentioned the notion of addressing this to the owner of the collation data and her look could have killed. :-)
Though I am inclined to agree with her -- its a dangerous road to go down for an API like
CompareString. Or even for higher level callers like OE (luckily most of them are sorted by date anyway).
# Peter Ibbotson on 14 Jan 2005 3:38 AM:
Thanks for that, I'd still like an option to turn it off though. One of the problems I have is that I have loads of file names with that kind of structure, and depending on what program I'm using to look at files, the file I want may be at either end of a list.
It's a real pain as you swap between source control and explorer.
Anyway thanks
# Michael Kaplan on 14 Jan 2005 7:00 AM:
Hi Peter,
It actually *can* be changed via a Group Policy setting. Precisely because not everyone will neccesarily like it....
You can even query this policy's settings with the
SHRestricted API and the REST_NOSTRCMPLOGICAL member of the
RESTRICTIONS enumeration.
# Mike Williams on 27 Jan 2005 2:55 AM:
The other sorting annoyance is this sort of thing:
The Sea II.mp3
The Sea.mp3
This is also seen with Windows Media Player's Album list.
# Michael Kaplan on 27 Jan 2005 2:59 AM:
Yes, though (to me) it actually does make sense since space comes before dot....
Though I would not mind if Explorer removed the extension from consideration for the sort, except to break ties. It seems like the results would be more intuitive more often.
# James on 12 Oct 2007 3:57 PM:
If the shell team is so "amazing", they should add an option to disable number sorting. Its very annoying when file names contain numbers and some of the numbers have leading zeros. I expect this sort order:
000.txt
003.txt
020.txt
100.txt
but now I get this:
000.txt
100.txt
020.txt
003.txt
Sometimes the reason we added leading zeros to the filename in the first place was to make the files sort correctly, now its broken again :-(
File names are text and should be able to be sorted as such.
Gee Thanks for this "new feature"!
# Michael S. Kaplan on 12 Oct 2007 4:17 PM:
James,
They do have such an option.
But FWIW I do not see the results you list here -- prefix zeroes still work fine for me.
# Maurits [MSFT] on 26 Oct 2007 4:36 PM:
> sex.txt
> elf.txt
(Um, ew.)
From personal experience, this number sorting thing is REALLY ANNOYING when you're looking at a huge list of subdirectories, all of which are GUIDs.
I agree with those that want an option to turn it off.
I might argue that the most important part of /any/ feature is the ability to turn it off.
Bury the off switch somewhere Mom can't find it, if it makes you feel better. But implement the off switch before implementing the feature.
# Maurits [MSFT] on 26 Oct 2007 4:38 PM:
(Now that I actually read the comments)
> It actually *can* be changed via a Group Policy setting
Ah, perfect.
# Maurits [MSFT] on 26 Oct 2007 4:41 PM:
# Maurits [MSFT] on 26 Oct 2007 5:19 PM:
Oy. I've dug through gpedit.msc on Vista RTM and I can't find anything that looks like it might be this option. Can I buy a vowel?
Please consider a
donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.
referenced by
go to newer or older post, or back to index or month or day