Hungarian is even more complicated than I thought
by Michael S. Kaplan, published on 2005/11/13 01:01 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2005/11/13/491646.aspx
Back in August in the post Double compressions -- Hungarian goulash? I described how double compressions worked in Windows and the .NET Framework.
It can indeed be a complicated feature to support, and not just for the reasons I explicitly stated but because you not only need to make text pieces like ddzs equivalent to dzsdzs but also because you had to treat both strings as if they contained two sort elements (more on sort elements here and here).
It is not so hard to do though, and we have supported it for a long time without people complaining about the support.
It turns out that the truth is even more complicated than that, though!
You see in the language there are two behaviors that are supposed to be captured:
- As I said, ddzs should be treated as equivalent to dzsdzs when doing comparisons;
- Additionally, ddzs should be sorted as if it were d + dzs rather than dzsdzs.
As I have said previously though, comparison is sorting on Windows. For linguistic purposes, both are done through the same basic functions, such as CompareString.
In order to support these two different operations, you would need to have an additional EqualString function to give the linguistic absolute equality question while still giving the different answer for collation. And the behavior of EqualString would almost always be identical to the behavior of CompareString returning CSTR_EQUAL, with the only exceptions being that:
- cases like above one in Hungarian double compressions could in theory be supported;
- it could be a bit faster since it no longer has to detect which comes first; any difference of any weight level would cause the function to immediately return the result.
Note that is really not as good of a reason that there being both an RtlCompareUnicodeString and an RtlEqualString in ntdll.dll, because both of those functions immediately return the results when any difference is found. Because although we could argue the speed differences between "the difference of two numbers not being zero" and "two numbers not being equal" in compiled code, it is nowhere near the order of magnitude of difference in speed you would see in a linguistic function that could stop on the first character an return FALSE when comparing Abcdefg and abcdefĝ rather than needing to walk the whole string to know it should return CSTR_LESS_THAN.
One unfortunate side effect of this post and talking about a theoretical EqualString is that the more I type, the more I think it might actually be a useful function to have available, given the large number of times that one might really prefer to answer the abolute identity question rather than a which one comes first question.
Though in principle, in most cases absolute identity is only important in binary/ordinal comparisons, not linguistic ones -- such as looking at filenames and other symbolic identifiers. This issue with Hungarian double compressions being a great example of an exception to that principle, of course.
It is interesting to speculate why complaints have never been escalated by the Hungarian users of Windows, since as our collation is sorting rather than absolute identity, the behavior is technically incorrect. Although I suspect that
- The number of real world situations where such string comparisons might return different results could be reasonably small;
- Hungarian customers may simply accept and enjoy the identity-type behavior in both situations;
- Those customers may actually be resigned to the situation;
- Some other reason I do not understand.
Truth be told I am hoping it is mostly the first and/or second of these four options. :-)
Though I must say that language issues such as this one fascinate me and sometimes frighten me, as I think about how the behavior on Windows can shape user experiences and expectations across a culture. It definitely encourages me to try hard to do right by the language and not make decisions that would negatively impact a language or its usage!
This post brought to you by "ĝ" (U+011d, a.k.a. LATIN SMALL LETTER G WITH CIRCUMFLEX)
# Szajd on Sunday, November 13, 2005 6:26 AM:
Yes, Hungarian is quite difficult and complex.
Let me illustrate you another one: we have letters that consist of two characters (dzs is the only letter in the Hungarian alphabet which is three characters long). We have these letters also: sz and zs. And we have this word here: egészségedre, which is not a rare word (it means: "bless you"). How should a machine know, that in the middle its [sz + s] or [s + zs]?
And also with dzs, you could imagine the following example now: you not only have the problem of ddzs being [dzsdzs] or [d + dzs], but dzs itself could be [d + zs]! Wow.
But I for myself can't tell you any real-world examples for these. I mean I really can't say nothing for the problem you described. For the first proble mI described, there are some other words in Hungarian where there could be problems with sz and zs. And in the third one, one might imagine a compound word, where the first part ends with a letter D, and the second starts with zs.
You wanna learn some Hungarian? It's easy. ;)
# denis bider on Sunday, November 13, 2005 9:31 AM:
I think people need to simplify their languages.
In my experience, the primary reason such complexities endure is destructive: nationalistic pride, arising out of a fear that the nation will otherwise be trampled. Fear of change, fear of foreigners.
In order to protect this sense of false security, people go to great lengths to maintain traditions which would have long become a useless burden otherwise. Examples are plenty : the French language board, a Don Quixotic attempt to "protect" against English; Japan and Kanji, taking children a perplexingly long time to learn, which they could spend learning many more things that would be more useful (including, perhaps, a foreign language?).
We are all human; only good can come from mixing freely. Maintaining disfunctional traditions just for the sake of it does not lead anywhere (except perhaps to war). Nationalism is inherently bad.
I'm not American, and not from an English speaking country either. But I think people should embrace the opportunity to unite (in a global language, and a global culture associated with it) rather than putting on brakes.
# Michael S. Kaplan on Sunday, November 13, 2005 9:56 AM:
I do feel that inconsistencies should be fixed so that things can be implemented and used, but I do not feel it is fair to 'simplify' language where 'simplify' translated to cultural imperialism.
I think it is unfortunate that it has taken as long as it had to support some languages, but I would not want to see language diversity destroyed....
# Michael S. Kaplan on Sunday, November 13, 2005 10:01 AM:
Hi Szajd --
Our implementation would find the first one and use it for sorting, which would be unfortunate if the second one was the intended one. And of course without a dictionary behind it there is no way to know the difference between dzs and d+zs.
As I just wrote to Denis, I am in favor of simplifications that eliminate inconsistencies or rules that cannot otherwise be implemented in software, but NOT if such changes are just due to laziness (thus vertical support or bidi is not in the same category for me as linguistoc compressions).
I wish I had time to learn Hungarian!
# denis bider on Sunday, November 13, 2005 10:02 AM:
Diversity is not destroyed, it is replaced. All diversity requires maintenance, if by "diversity" you mean things that are alive (as opposed to a dead, but well-documented language such as Latin). The amount of diversity that is maintained in the world is therefore a function of the number of people living and the freedom they have. Language boards like the French actually restrict the freedom people have in expressing in diverse ways, so these attempts actually decrease the total amount of diversity.
# Michael S. Kaplan on Sunday, November 13, 2005 10:09 AM:
I hear you, but I do not think that people outside a culture have the right to choose what is simply cruft and what is valuable. If change comes from within (something that has happened many times in many places) then that is fine. But if it comes from without then it is imperialism, and if software enforces it then the software is an agent of a destructive policy....
# denis bider on Sunday, November 13, 2005 11:04 AM:
I agree, certainly the freedom of the people to choose is theirs. But software that fails to implement local idiosyncrasies does not enforce cultural imperialism. It just fails to implement a locally important feature, and it's going to get bad reputation for this, and if it's not improved it can be locally replaced by a competitor that does impleemnt the feature. :)
(If you're saying Microsoft's software cannot be replaced by a local competitor, you're admitting a monopoly. ;) )
# Michael S. Kaplan on Sunday, November 13, 2005 11:31 AM:
Actually, what I notice is how hard other software tries to emulate (or at least be compatible with) Microsoft software.
If there are two dictionaries with two different collations, and Microsoft after working with government and linguitic experts chooses which one to use, and then other implementations follow, then have these companies affected language policy, creating language policy?
It definitely makes a difference, one I have actually lost sleep over on occasion!
# Szajd on Sunday, November 13, 2005 4:44 PM:
Also, is there a way to find out the human-readable algorithm for a specifig language sorting. I'm pretty interested in Hungarian's. Or is that a secret.
But otherwise, I don't think it's a secret, but I couldn't find the difference about the so-called traditional and technical Hungarian sorting. What are these two things?
# Michael S. Kaplan on Sunday, November 13, 2005 5:12 PM:
Microsoft does not publish its tables or its algorithm, beyond the sorts of hints given in the first edition of Developing Interntional Software and on posts done this blog.
It is not an ownership issue (since we do not own a language's sort any more than a map company owns the land they sell maps for), but it is an IP issue, given the time, effort, and expense to develop (the same way a map company would not want people copying their maps and selling the copies).
The Hungarian technical sort does not have any of those compressions (the combinations of letters such as dz that both of us were taslking about earlier) and it puts uppercase characters before lowercase ones.
# Peter on Wednesday, November 16, 2005 7:05 AM:
The top reason is Reason #1 -- like Szajd, I cannot think of any real-world examples when it would make a difference.
The only case I remember when someone complained was due to the difference between the default and the technical sort order of Hungarian: using default order, a case-insensitive comparison of "PicSize" and "picsize" will tell that the two text is different, because "cs" is a stand-alone letter in Hungarian. As Szajd pointed out, the obvious problem with this is that no software can tell what is "c" + "s" and what is "cs" or what is "sz" + "s" or "s + zs".
# Michael S. Kaplan on Wednesday, November 16, 2005 7:17 AM:
Ah, the reason for the first problem you mention is that when you use the default Hungarian we treat CS, Cs, and cs specially, but not cS (see http://blogs.msdn.com/michkap/archive/2005/07/17/439742.aspx
for more information on that one). And then in the technical support there are no compressions so they are definitely related. But abbreviations like that can definitely make things confusing!
For the second issue, we will see the "sz" before we ever get to the last "s" so we have behavior that is consistent, if not correct....
# Peter on Friday, November 18, 2005 5:47 AM:
Thanks Michael, that makes sense and appears to be the right approach in comparison's view of point.
I fully understand, however, the surprise of the developer who faced with this PicSize vs. Picsize problem -- as humans, we expect that the OS will recognize that we used English and ignore the locale-based comparison :)
# Miklos Hollender on Friday, November 18, 2005 9:14 AM:
The IDEA of collation itself is a completely crazy thing.
I tell you a story. I had a strange error on MS SQL Server. select ... where [Product Identifier] = '%SG%' did no find the product with the identifier of "KCSG01"
A friend suggested that maybe it treats "cs" as one letter. I said impossible, even MS can't be so crazy. And he was right - after setting collation to binary it worked!
I it is completely amazing - who wanted this this feature? Who needs it? Why did it have to be developed and hardcoded into Windows/MS SQL? I agree that a grammatical analyser function library might sometimes useful to someone, but to hardcode it right into the OS!... Why?
When users search for "ddzs", they don't want to find "dzsdzs" - they are searching for LETTERS, you know, they don't want to keep all these grammatical rules in their heads. No one expects that their search input will be grammatically analysed!
So why has this feature been implemented?
# Michael S. Kaplan on Friday, November 18, 2005 9:26 AM:
Hi Miklos --
You are in the minority here, not the majority. Most users do want the collation to use the conventions of the language. This feature has existed for over a decade and is actually highly valued by a lot more people than it is considered 'crazy.'
# Zka on Saturday, November 19, 2005 3:05 AM:
I agree Miklós. I've never seen anyone missing this crazy collation... But I've seen lots of software errors caused by it... However it's not hardcoded into MSSQL servers - thankfully :)
# Peter on Monday, November 21, 2005 2:37 AM:
I agree with Michael -- collation is the foundation base of sorting and most casual computer users expect their programs to sort their data in the natural language order. Indeed, it is often ignored that the default Hungarian collation takes compressions in consideration and that string comparsion is affected by this. At least developers and DBAs should be aware of this, however.
MSSQL 2000 (maybe earlier versions as well) and a bunch of other DBMS's like Interbase/Firebird allow you to define the collation per column, so it might be a good idea to use the default Hungarian collation only on fields that absolutely need it. Alternatively, switching the entire DB to technical/binary collation could help (Hungarian_Technical instead of Hungarian).
Another option is to specify a different collation for search, e.g.
SELECT ... WHERE [Product Identifier] LIKE '%SG%' COLLATE Hungarian_Technical_CI_AS
go to newer or older post, or back to index or month or day