The sort may not be reproducible

by Michael S. Kaplan, published on 2007/02/05 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/02/05/1602498.aspx


Jeff asks:

I just met some problem when using FindFirstFile-FindNextFile:

It seems these 2 API think ~  > _ > [a-z] > [0-9] (so _dummy.doc will be returned before ~dummy.doc); while both lstrcmpi and CompareString think _ >  ~. So what string comparison function is used internally in these 2 API? When comparing 2 directories, it is critical to know the char order used by these 2 APIs.

Unfortunately for Jeff, the docs for FindFirstFile and FindNextFile both make it perfectly clear that if you care about the order of the results and need any kind of independently reproducible determinism that you are really somewhat screwed:

The order in which the search returns the files, such as alphabetical order, is not guaranteed, and is dependent on the file system. You cannot depend on any specific ordering behavior. If the data must be sorted, you must do the ordering yourself after obtaining all the results.

...

The order in which this function returns the file names is dependent on the file system type. With the NTFS file system and CDFS file systems, the names are returned in alphabetical order. With FAT file systems, the names are returned in the order the files were written to the disk, which may or may not be in alphabetical order.

And clearly the meaning of "alphabetical order" is subject to all kinds of random factors based on different functions that would be used to return something that would be called alphabetical order (and the actual results are being returned at a level that is well below any code that could access functions like CompareString or lstrcmpi anyway, even if that behavior was desired).

In any case, the simple fact that results vary with different types of volumes would suggest that the actual ordering is most likely based on an order returned by the driver of the file system -- another reason why one should not rely on the results never changing....

 

This post brought to you by ~ (U+007e, a.k.a. TILDE)


dm on 2 Jan 2008 5:01 PM:

Maybe it's because of a bug in MS implementation of STL. I just encountered some weird sort results with FindNextFile, and then the same results sorting a wstring vector.

Consider the following code:

bool r1 = wstring( L"_" ) < wstring( L"a" );

bool r2 = wstring( L"_" ) < wstring( L"A" );

You might think that r1 == r2, but it's not so. It happens that 'A' < '_' < 'a' in wstring.

P.S. I use VS 2005.

Michael S. Kaplan on 2 Jan 2008 5:43 PM:

It is not a bug -- that is appropriate BINARY ordering, which is the default here.

dm on 2 Jan 2008 5:46 PM:

I did a bit more investigating and it turned out that wstring comparison operators perform exactly like wcscomp (as it should be).

Actually, using lower case string comparison solved my problem. I could produce the sort order like in Windows Explorer.

The problem with FindNextFile seems to be that it yields case-sensitive sort order while Explorer uses lower-case sort.

Michael S. Kaplan on 2 Jan 2008 5:49 PM:

Ick, no -- use StrCmpLogicalW if you want results like Explorer -- that is what *it* uses....

Again, FindNextFile results are in no at guaranteed to be alphabetical, or binary, or anything really.

dm on 2 Jan 2008 6:10 PM:

Oh, thanks, I'll use that :)

As to FindNextFile, on my system at least it gives sorted results, but the sorting is case-sensitive. Ok, now I know that I need to resort stuff :)

Michael S. Kaplan on 3 Jan 2008 6:08 AM:

Try looking at letters with diacritics and you'll see -- it is in no way alphebetic -- it is the binary order, which happens to have the alphabet in A-Z order there....


Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day