by Michael S. Kaplan, published on 2005/08/10 10:38 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/08/10/449909.aspx
Different languages bring different expectations. And people who natively speak, write, and read languages have those different expectations.
Now that is fine, and of course the time it gets harder is when those different expectations conflict with each other, or when they don't but a user who is trying out a language that is not theirs is confused by the results.
This happens with genitive date forms, reverse diacritic sorting, and many other features specific to various languages supported by Windows. And there are more of these types of features to come in the future, of course.
I thought I would talk about another one today -- a feature that is seen in Hungarian, that of double compressions.
Now I have talked about compressions in the past, like the Traditional Spanish ch being treated as if it were a single element for the purposes of sorting. This feature I am talking about today is a little more complicated.
Basically, the feature is such that since dzs is a compression within Hungarian, that if you see ddzs it should be treated as equivalent to dzsdzs.
Now this is an important issue on two levels -- basically dealing with the fact that collation deals with both comparison and ordering. I'll talk about both aspects now.
COMPARISON -- In Hungarian comparisons, having ddzs treated as being equal to dzsdzs is simple enough. This may not be commonly needed though, in practice -- people likely use one or the other most of the time.
ORDERING -- In Hungarian ordering, having ddzs treated as being equal to dzsdzs is again simple enough (the cost is the same!), but in this case it may be more important in terms of day to day usage -- since if you are not aware of the feature and you expect the string in the wrong place then you may not find the data you are looking for.
This particular feature does its work on all of the various digraph/trigraph compressions that exist in Hungarian. It is noteworthy that when non-native speakers of the language notice the issue, they are not sure whether there is a bug or not. But that actually points back to the strength that exists in Windows and the .NET Framework by virtue of the solid connection between linguists and developers. And also the strength of strong international feature testers. :-)
In the end, attempting to reverse engineer an implementation is (in my opinion) a flawed approach, though the observations of others can be quite interesting and illuminating as they delve. :-) The real strength of an implementation if the people who reverse the dictionaries, the core language elements that give expected orderings. It is like the difference between drawing a map by looking at someone else's map as a source and drawing a map using the actual region as a source -- one is imitation, the other is art (in the cartography case, it is a bit more harsh -- one is theft of intellectual property and the other is not!).
The other reason why I think that attempting to reverse engineer an implementation is flawed is that it is unlikely that every case will be handled (there are too many of them!). If you base the support on actual languages then you can work to have complete or nearly complete feature sets; if, however, you base the support on someone's implementation then you never know what features you are missing -- and what language support might be thereby incomplete.
Does it make us better? Probably not the sort of thing subject to that kind of value judgment, so I won't say one way or the other. But I know that I feel much more like I am doing my level best when I handle things this way.... :-)
This post brought to you by "DZ" (U+01f1, a.k.a. LATIN CAPITAL LETTER DZ)
# Maurits [MSFT] on 10 Aug 2005 12:51 PM:
# Michael S. Kaplan on 10 Aug 2005 1:58 PM:
# Peter from Hungary on 11 Aug 2005 3:14 AM:
# Michael S. Kaplan on 11 Aug 2005 7:28 AM:
# Peter on 11 Aug 2005 8:58 AM:
# Michael S. Kaplan on 11 Aug 2005 11:17 AM:
# Michael S. Kaplan on 11 Aug 2005 11:18 AM:
# Peter on 12 Aug 2005 2:03 AM:
# Archer on 16 Aug 2005 6:29 PM:
# Michael S. Kaplan on 16 Aug 2005 8:13 PM:
referenced by