Which comes first, 'a' or 'A' ?

by Michael S. Kaplan, published on 2005/11/13 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/11/13/492179.aspx


A wise man (well, I think it was the comedian Emo Phillips, does he count?) once spoke the following little fable:

I had an argument with my father. I argued that Plato was the father of philosophy. My dad of course took the opposite position, that I should wax the kitchen floor.

I said: "Well, the kitchen floor doesn't exist! At least not in the permanent sense that the concept 'floor' does."

He said: "Do you think the concept 'your skull' exists?"

I said: 'Yes'. And then he surprised me by juxtaposing the two concepts.

Someone was trying to tell me about it the other day but I made it clear I had already heard it (my sources of knowledge are numerous but perhaps not impressive).

Later on, I decided I would juxtapose some things in a blog post. :-)

Here goes....

The concept of alphabetic case is interesting. And so is the concept of linguistic collation. So let's juxtapose those two concepts for a moment.

Which comes first -- uppercase or lowercase?

Well, in a binary sort, the answer is simple -- uppercase comes first. Every time. It is how code points are encoded in Unicode. Period.

In a dictionary, the uppercase also often does come first (or they are put together as multiple definitions in one entry).

In linguistic collations on Windows, in most locales1, lowercase by convention comes first.

Like I said in the post Why do the high surrogates have the low numbers?, however, it is simply a conceptual construct.

When you deal with collation in terms of weights, it is easy to take the uppercase letters as being somehow heavier since they are usually (bordering on always) bigger and taller.

I have had people tell me that they think this is incorrect; they believe that it should always be the other way around. But for the most part that is simply rebelling against the construct we are using, and preferring a different one.

So, those of you out there who think uppercase should be sorted before lowercase, what is the conceptual construct you are using?

Just curious....

 

1 - Bonus points for anyone who knows which collation(s) under Windows break this rule without testing them first!

 

This post brought to you by "" (U+1e4f, a.k.a. LATIN SMALL LETTER  O WITH TILDE AND DIARESIS)


# Vorn on 13 Nov 2005 3:16 AM:

On the main blog page, the topic title is in all caps. Makes this particular post's title very strange.

Vorn

# Jerry Pisk on 13 Nov 2005 3:23 AM:

I think you meant that uppercase letters are usually bigger and taller, not lowercase ones.

I personally think uppercase should come first but that may be influenced by the years of using computers. Still, uppercase usually denotes two things - an abbreviation or a name. Should those be sorted before the lowercase variations, which would usually denote a generic term (Windows versus windows)? That'd be a matter of personal preference I guess.

# Norbert Lindenberg on 13 Nov 2005 4:26 AM:

If "uppercase comes first" is a rule in Unicode, then it is not without exceptions. ÿ comes before Ÿ in a comparison based on code points.

# Pavel Šrubař on 13 Nov 2005 4:50 AM:

People preferring <b>AaBb</b> will argue that uppercase letters indicate <em>proper names</em>, and named things and people should have priority to generic words. It started from Adam, they say, and I bet there's a lot of women among them.
<br>
<b>aAbB</b> fiends, on the other hand, may object that everything grows from small to big. It seems illogical if a tall letter predecese its lower counterpart in <em>ascending</em> sequence. When I was a kid, I was lower and thin, only much later I ascended to my six feet uppercase.

# Baciu Valentin on 13 Nov 2005 6:47 AM:

Old printers had only upper case letters. To keep compatibility they just added lower case letter to the already existing set.
So, from this point of view, 'A' comes first.

# Dean Harding on 13 Nov 2005 5:47 PM:

The whole idea of sorting (at least for latin-based scripts) is just convention anyway... I mean, you may well ask "why should 'A' come before 'a'?" but then why not ask "why should 'a' come before 'z'?" there doesn't seem to be any actual reasoning behind the 'sort order' of our alphabet at all!

Unless, I'm missing something here...

# Michael S. Kaplan on 13 Nov 2005 6:13 PM:

Well, that is mostly true, Dean. Though individual languages often and individual purposes often do have strong preferences for ordering that have to be respected. Even though it will at some level be arbitrary....

# Petr Kadlec on 14 Nov 2005 5:15 AM:

Note that someone might prefer AaBb for historical reasons -- latin was originally written using majuscule only. (The official current (it differs from its previous version, where lowercase were sorted before uppercase) Czech standard now explicitly does not distinguish between upper- and lowercase BTW.)

# Jerry Pisk on 14 Nov 2005 2:04 PM:

OT: Is half of the readers of this blog Czech? Or is it that we just like to comment on things?

# Michael S. Kaplan on 14 Nov 2005 3:30 PM:

Hmmmm.... not sure. I might be really big in the Czech Republic after my visit there a few years ago!

# Jerry Pisk on 14 Nov 2005 5:27 PM:

Another thing that is not clear here: are both upper and lowercase A sorted before either case of B? An upper/lowecase letter weighs less than its counterpart but not then the next letter in the alphabet or is it whether all upper/lowercase weigh less than all the other cased characters?

# Michael S. Kaplan on 14 Nov 2005 7:23 PM:

Yes, they both come before B.

'Ignore Case' here means literally ignore the case difference, do not consider there to be any difference.

# Centaur on 15 Nov 2005 3:10 PM:

Take any dictionary. Look at the page that lists the alphabet. In Russia, most sort the alphabet first by letter, then by case, with upper coming before lower.

referenced by

2010/03/09 Coloring outside the lines in the a-ness of the Hungarian Technical Sort

2010/03/06 Burn Windows Burn (aka If we want to unsay *this* one, we cannot say "Mu")

2007/12/06 In SQL Server, A-Z, A-z, a-Z, and a-z may not mean the same thing!

2006/11/01 If you add enough characters to a sort, intuitive distinction can suffer

2005/11/30 Expectations around collation

2005/11/26 Technically it *is* a hungarian sort

2005/11/18 Some sort of order to collation

go to newer or older post, or back to index or month or day