Technically it is a hungarian sort

by Michael S. Kaplan, published on 2005/11/26 04:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/11/26/495072.aspx

Back in August in the post Double compressions -- Hungarian goulash? I described how double compressions worked in Windows and the .NET Framework.

And then a week ago in Hungarian is even more complicated than I thought I talked about an additional interesting wrinkle in this particular language's collation.

There were some interesting comments in that post, like this one:

I tell you a story. I had a strange error on MS SQL Server. select ... where [Product Identifier] = '%SG%' did no find the product with the identifier of "KCSG01"

A friend suggested that maybe it treats "cs" as one letter. I said impossible, even MS can't be so crazy. And he was right - after setting collation to binary it worked!

I it is completely amazing - who wanted this this feature? Who needs it? Why did it have to be developed and hardcoded into Windows/MS SQL? I agree that a grammatical analyser function library might sometimes useful to someone, but to hardcode it right into the OS!... Why?

When users search for "ddzs", they don't want to find "dzsdzs" - they are searching for LETTERS, you know, they don't want to keep all these grammatical rules in their heads. No one expects that their search input will be grammatically analysed!

So why has this feature been implemented?

To which I thought about the fact the Hungarian Technical Sort exists as an alternate sort for Hungarian (its LCID is 0x1040e). This sort has several characteristics that distinguish it from the standard Hungarian sort (0x040e):

None of the compressions that I have talked about previously
None of those Hungarian double compressions, either
The uppercase letters come before the lowercase ones, unlike most other language collations on Microsoft products

There is, in fact, nothing uniquely Hungarian about it and anyone who was wanting the uppercase/lowercase thing reversed might be happy with the ordering.

The perfect answer for those more technical situations when one does not want to be bogged down by those linguistic collation details, right? :-)

This post brought to you by "ʥ" (U+02a5, a.k.a. LATIN SMALL LETTER DZ DIGRAPH WITH CURL)

# Nick Lamb on 26 Nov 2005 6:30 AM:

"There is, in fact, nothing uniquely Hungarian about it"

Yet no-one's surprised people like it. Did Microsoft ever do surveys of real users (not linguists, not people who have a nationalist propaganda mission) to find out how many of them actually want the supposed oddities of "their" language systemised in the computer? It's been mentioned before that a lot of people can't describe the collation rules for their language, but how many really /prefer/ the official collation rules, given an option? How many can look at a dozen lists and pick out the one that's "in the right order" ?

Unfortunately when something is done wrong in a million computers it may be a lot harder to fix than a mistake (or exageration) in a textbook or research paper. So we ought to be very careful before enshrining such rules in a computer program, even if implementing them is fun.

IMNSHO when you find yourself forcing people to stay true to someone's ideal of their culture or language, you've made a huge mistake somewhere. This applies to Welsh culture & language laws in the UK (forcing children to learn an essentially dead language to satisfy the ambitions of some nationalist politicians) even more than the Académie française (at least the Académie doesn't have force of law).

# Michael S. Kaplan on 26 Nov 2005 5:21 PM:

Nick,

When people are unhappy with behavior on Windows, we hear about it. And I am not talking about people who read this blog, I mean regular users who do not like the behavior.

I know you think what Windows does here is incorrect, but I have millions of users who use Windows and find the results to be intuitive that are basically suggesting implicitly that you are in the minority.... :-)

# Nick Lamb on 27 Nov 2005 4:53 AM:

"I know you think what Windows does here is incorrect"

You've misconstrued my comment. I don't know whether this is incorrect, I was wondering (and you could have answered) whether it was ever put into a usability lab for comparison. I do know that several people in my circle were relieved when I showed them how to disable the "English" collation rules on their Unix systems, but my sample is biased, lots of technical and scientific types, not many warm fuzzy people.

By the way, how can they do that in Windows? They'd like to, but I couldn't find any way to help them.

# Michael S. Kaplan on 27 Nov 2005 6:32 AM:

Which English collation rules in particular?

# Nick Lamb on 27 Nov 2005 8:42 AM:

Particularly the case ordering, they want (THAT, THIS, the other) rather than (THAT, the other, THIS).

# Michael S. Kaplan on 27 Nov 2005 11:52 AM:

We do not really get a call for binary ordering at all, really. There is no mode in Windows that gives you that currently.

Possible on UNIX? Well, okay. Perhaps that explains why UNIX isn't really a consumer OS? :-)

# Szajd on 29 Nov 2005 11:34 AM:

Nick,

You're thinking is bad, because you can't imagine a language that has, for example, letters consisting of two characters.

For us, Hungarian's, such a thing is not an "oddity", a "strange thing", or even a thing to think about; it's the most natural thing in the world.

I suppose, you're language is English, and that's bad, because it's hard for me to tell you an example, to imagine yourself in such a situation.
But let's try it: imagine, that the English language includes the letter 'ph': if your operating system would handle that one as a letter P and H, THAT would be odd. Because it would brake something, that is actually natural to you.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/07/13 I swear the Latvian bug is fixed; it was fixed 4.5 years ago!

2010/03/09 Coloring outside the lines in the a-ness of the Hungarian Technical Sort

2010/03/06 Burn Windows Burn (aka If we want to unsay *this* one, we cannot say "Mu")

2007/12/06 In SQL Server, A-Z, A-z, a-Z, and a-z may not mean the same thing!

2006/09/02 Every character has a story #23: U+00ad (SOFT HYPHEN)

2006/03/02 CompareString ignores case by lowercasing....

2005/11/30 Expectations around collation

go to newer or older post, or back to index or month or day

Technically it *is* a hungarian sort

Technically it is a hungarian sort