What's the shape of the sort?

by Michael S. Kaplan, published on 2008/11/01 11:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/11/01/9028213.aspx


There is an old Marx Brothers routine that goes something like this:

Groucho: What's the shape of the world?
Harpo: It's terrible.
Groucho: No, I'm talking about the shape.
Harpro: Oh, that's different.
Groucho: So what's the shape of the world?
Harpo: I don't know.
Groucho: Well, what's the shape of my cuff links?
Harpo: Square.
Groucho {becoming exasperated}: Not these cuff links, the ones I wear on Sunday.
Harpo: Round.
Groucho: So, what's the shape of the world?
Harpo: Square in the weekdays, r
ound in Sunday!

I was thinking about it the other day, when a question came up about whether one would prefer CompareStringA or _stricmp for string comparisons in a particular situation.

That could really be thought of as a trick question -- the preference is obviously to use Unicode strings and be worried about CompareStringW or _wcsicmp!

But in this case the component happened to be dealing with "ANSI" strings, so I'll temporarily reject the premise that the question is flawed. I'll get back to it in a moment.

The question is somewhat similar to the one between _stricmp and _stricoll, with the added bonus of the confusion between what code page is used to interpret how the 256 bytes of the ANSI code page are going to be interpreted.

In fact, to make the functions more analogous for comparison/contrast purposes, it might be better to ask about either CompareStringA with NORM_IGNORECASE vs. _stricmp_l (so you always get to specify the locale to use for determining the code page) or CompareStringA with the LOCALE_USE_CP_ACP flag vs. _stricmp (so you never get to do so).

Maybe I should explain the situation for both of these options, now that I have brought the matter up....

To start, when looking at CompareStringA with the LOCALE_USE_CP_ACP flag vs. _stricmp, in both cases the ANSI string is assumed to be in the "default codepage", in the former case the default system code page that is the GetLocaleInfoA with LOCALE_IDEFAULTANSICODEPAGE of the default system locale (aka language for non-Unicode programs), and in the latter case with the equivalent to the LOCALE_IDEFAULTANSICODEPAGE of the current CRT locale (retrievable via _get_current_locale).

In the CompareStringA with NORM_IGNORECASE vs. _stricmp_l case, you specify the locale to use rather than using one controlled by settings that exist prior to the call

Once you know how to interpret the 256 bytes, you know what the characters are and you know how to determine the "case" of the characters, if they in fact have case.

For example, if the code page in question is Windows code page 1252, then in all of the above cases if you have the byte 0xC5 then you have the letter Å (U+00c5, aka LATIN CAPITAL LETTER A WITH RING ABOVE) which the case insensitivity of all of the functions will interpret as being the same as å (U+00e5, aka LATIN SMALL LETTER A WITH RING ABOVE), as I explain in an earlier blog (CompareString ignores case by lowercasing).

But if on the other hand the code page in question is Windows code page 1255, then in all the above cases if you have the byte 0xC5 then you have the point ֵ (U+05b5, aka HEBREW POINT TSERE), which will not lowercase to anything, whatsoever. Hebrew letters don't even have case, so of course its accents and points and marks asnd punctuation wouldn't!

Now this is not the same as "linguistic casing" which I first explained back in What does "linguistic casing" mean?, since in the CompareStringA case the flag is not being passed and in the _stricmp case it is never ever being passed.

This is why even though _stricmp is not considered a function that uses locale-specific information (in conrast to _stricoll), how it actually uses some locale information, albeit indirectly and (unless you parsed the first five of the previous six paragraphs with one reading) confusingly. Which, to get back to the Unicode (CompareStringW or _wcsicmp) issue, if you keep it in Unicode then you can ignore the first five of the previous six paragraphs, which are really the most confusing parts anyway.

This then lets you look at the real issue that the CompareStringA vs. _stricmp (and also CompareStringW vs. _wcsicmp) question was always about -- it really is a linguistic comparison vs. binary/ordinal one, which has all the issues that the whole invariant versus ordinal question dredges up, plus actually specifying any locale for comparison rather than having the single INVARIANT (yet still linguistic) choice.

And the answer to that question is, and has always been, that depends. On what is being compared.

This particular time, it happened to be looking at the name of a SQL Server, to see if it was the same as the computer it is running atop of.

In most cases, this would suggest to me the need for a binary/ordinalignorecase type semantic, though there night be some edge cases (since a SQL Server is involved) that you' want SQL Server type comparisons, which would be dependent on the collation of the SQL Server, either linguistic or the SQL Server notion of binary, which has no case ignoring facility.

Am I the only one who sees a conversation about which function to use as a very geeky, extended mapping of that Marx Brothers bit? :-)

If I am ever at some random event and can find someone to play the Harpo role, I'd even try and script it out and perform it!


This blog brought to you by ֵ (aka U+05b5, aka HEBREW POINT TSERE)


Blake Handler on 1 Nov 2008 11:58 AM:

Huh? Harpo didn't talk?

Michael S. Kaplan on 1 Nov 2008 12:33 PM:

The original incarnation of that skit had Harpo; it was before he "lost" his voice in the act. I think Chico did that skit in later acts (I recall vaguely reading about this in a book that included Groucho's letters - ref: The Groucho Letters).

Andrew West on 3 Nov 2008 5:13 AM:

As long as I don't have to speak, I'll play the Harpo role.

Michael S. Kaplan on 3 Nov 2008 10:21 AM:

Well, I'll need someone to do a modified version of the role in the skit that I quoted, so some speaking would in fact be required, sorry! :-)


go to newer or older post, or back to index or month or day