Double compressions -- Hungarian goulash?

by Michael S. Kaplan, published on 2005/08/10 10:38 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/08/10/449909.aspx

Different languages bring different expectations. And people who natively speak, write, and read languages have those different expectations.

Now that is fine, and of course the time it gets harder is when those different expectations conflict with each other, or when they don't but a user who is trying out a language that is not theirs is confused by the results.

This happens with genitive date forms, reverse diacritic sorting, and many other features specific to various languages supported by Windows. And there are more of these types of features to come in the future, of course.

I thought I would talk about another one today -- a feature that is seen in Hungarian, that of double compressions.

Now I have talked about compressions in the past, like the Traditional Spanish ch being treated as if it were a single element for the purposes of sorting. This feature I am talking about today is a little more complicated.

Basically, the feature is such that since dzs is a compression within Hungarian, that if you see ddzs it should be treated as equivalent to dzsdzs.

Now this is an important issue on two levels -- basically dealing with the fact that collation deals with both comparison and ordering. I'll talk about both aspects now.

COMPARISON -- In Hungarian comparisons, having ddzs treated as being equal to dzsdzs is simple enough. This may not be commonly needed though, in practice -- people likely use one or the other most of the time.

ORDERING -- In Hungarian ordering, having ddzs treated as being equal to dzsdzs is again simple enough (the cost is the same!), but in this case it may be more important in terms of day to day usage -- since if you are not aware of the feature and you expect the string in the wrong place then you may not find the data you are looking for.

This particular feature does its work on all of the various digraph/trigraph compressions that exist in Hungarian. It is noteworthy that when non-native speakers of the language notice the issue, they are not sure whether there is a bug or not. But that actually points back to the strength that exists in Windows and the .NET Framework by virtue of the solid connection between linguists and developers. And also the strength of strong international feature testers. :-)

In the end, attempting to reverse engineer an implementation is (in my opinion) a flawed approach, though the observations of others can be quite interesting and illuminating as they delve. :-) The real strength of an implementation if the people who reverse the dictionaries, the core language elements that give expected orderings. It is like the difference between drawing a map by looking at someone else's map as a source and drawing a map using the actual region as a source -- one is imitation, the other is art (in the cartography case, it is a bit more harsh -- one is theft of intellectual property and the other is not!).

The other reason why I think that attempting to reverse engineer an implementation is flawed is that it is unlikely that every case will be handled (there are too many of them!). If you base the support on actual languages then you can work to have complete or nearly complete feature sets; if, however, you base the support on someone's implementation then you never know what features you are missing -- and what language support might be thereby incomplete.

Does it make us better? Probably not the sort of thing subject to that kind of value judgment, so I won't say one way or the other. But I know that I feel much more like I am doing my level best when I handle things this way.... :-)

Thought you might find this amusing, in the international/Hungarian goulash sense. If you find it offensive I apologize.

Sung to the tune of Brahms' Hungarian Dance #5:
http://lyrics.rare-lyrics.com/A/Allan-Sherman/Hungarian-Goulash-No-5.html

Jut nitpicking, but actually goulash consists of less components*. And we don't eat it that frequently at all :)

Off topic: Michael, have you noticed that you've been Joel'd? [http://www.joelonsoftware.com/items/2005/08/10.html]

* One typical version is onion, beef, hot paprika, green paprika, potato, tomato and spices. Even the famous bolognese sauce takes more components :)

I fully support the idea of another book, this time not bound to a language, but the Windows platform itself. Although I am not involved in i18n more than an average developer, I find this topic very interesting. Checking amazon.com, there are not so many books about this.

As for the gulyás, anytime you come here, I will be happy to cook for you :)

Less off topic, I tried to find any word used in the everyday life which has double 'dzs', but the ones I could came up are rather forced like 'briddzsentri' (bridge gentry). Virtually any words I know with 'dzs' is 'imported' word, like 'nindzsa', 'dzsezz', 'dzsip' or 'dzsinn' (not so hard to find out the original ones) and often we use the original forms, except for words imported ages before, like 'handzsár' (scimitar), 'lándzsa' (spear), which most likely coming from Turkish, judging by their sound.

I think we have given you some extra work also with the two sort orders (default and technical) of Hungarian :)

I don't think there is anyone who minds the "extra work", so no worries there. :-)

You could take a look at the 2nd edition of Developing International Software to see if that helps you as an intro to the topic....

Thanks for the tip on Developing International Software, it is on my wish list since February, but other books always took its place in my orders. Now at least I know that it is worth to buy that.

Hi Archer --

Not much to do with Hungarian, huh? :-)

I am not familiar with the term "natural sorting" so I cannot say the best way to do it in SQLS. You may want to ask in the newsgroups or other product support venues, though....