My own personal thoughts about collation in the Mono project

by Michael S. Kaplan, published on 2005/11/03 18:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/11/03/484974.aspx


(IMPORTANT NOTE: this post is just my personal delusionsopinions about the topic, I am certainly speaking for neither Microsoft nor the Mono project. If there was ever a time to read this blog's disclaimer, it is right now!)

(IMPORTANT NOTE #2: I am not getting into the philosophical issues of proprietary versus open source, really trying to stick to technical issues. I hope the comments, if any, will do the same. Thank you for your consideration.)

Back in the beginning of August, in the post the further you look into it, the further things stick out, I worked through a few of the problems that Atsushi Eno had reported about in his own investigation of collation for the Mono implementation, when looking at results in ours.

Shortly after that, this post went up, which describes the experiences, as well as the conclusions, after working hard to get the same results as Microsoft.

Of course long time readers of this blog know that Microsoft does not use the Unicode Collation Algorithm, and if you read that post you might even know why. :-)

Of course Atsushi Eno is right, this does make it harder for a project like Mono to emulate the results, especially since there is no easy or even reasonably difficult way to reverse engineer the results.

I cannot tell from the brief description, but it looks like he has figured out the DEFAULT table, but has not picked up all of the different collations for the various cultures. Which would make sense although probably still be missing information about Kana and about Old Hangul and possibly even about nonspacing marks, depending (I am not running Mono to test any of that, so I really cannot say for sure).

So now I will give my opinions on the matter. :-)

These are just my opinions, and not those of Microsoft. Just in case I was not clear about that earlier....

I noticed on the pages the heavy concern about performance, comparing ICU to Microsoft to Mono; on that point I would say as Gene Apperson said so many years ago: "It doesn’t matter how fast your code is if it doesn’t work" (this was later attributed to the fictional "Joe Hacker" in Bruce McKinney's Hardcore Visual Basic, though I think Bruce correctly attributed it somewhere in the book, too). This is nowhere more true than in collation, where correct results are much more important than results that are fast but may or may not meet customer needs. If they do not, then why bother writing the code in the first place?

If you ask me, Mono's collation work would be better off doing one of two things, either

Now for (B) I am not

And similarly I do not know enough about the Mono project to understand the philosophical/technical issues on that side, either. It may be impossible to get something like that done, on either or both sides of the equation.

I just do know that it is a nightmare to try and emulate complex algorithms built over more than a decade (the major effort to improve the tables as we are doing in Vista alone is a nightmare to try and handle from a change management standpoint if you have to try to make educated guesses as to how to obtain the results -- what will they do in future versions?).

Which is why I am suggesting that the current technique just seems way too sloppy to me -- it is much better to either follow path (B) and match the implementation more directly/conventionally or follow path (A) and match the conceptual goals by working to fill in appropriate linguistic support off of a good solid base. I think they would do better in the long run, and if all of us are trying to support the customers then in the long run we will both be giving comparable results any time we support the same customers.

Anyway, just a few random thoughts, feel free to disregard....

 

This post brought to you by "〇" (U+3007, a.k.a. IDEOGRAPHIC NUMBER ZERO)
One of those no weight characters that is fixed in Vista....


# Feral Boy on 3 Nov 2005 10:32 PM:

I know I'm extremely late on this, but I actually wanted to respond to your spoof article on the Matrix Reloaded Architect speech you did back in May (absolutely hilarious, by the way). Obviously, there's no longer a comment option on that old thread, so I'll do it here.

I'm not a programmer, but I'm trying to figure out some of the more obscure comments made by the Architect, and I was hoping you could help me out. I tried to find out what a "prime program" is, but I couldn't find much on it. My fellow philosophical types on the Matrix forums insist that a prime program is simply another way of saying the "main" program (i.e. the Matrix itself), but I don't buy it.

I found out that the phrase actually was coined by some IBM employee named Roy Maddux back in 1975. As far as I can tell, the concept relates to a single-entry, single-exit path that can't be broken down into any smaller steps. If this is the case, then I would say that the prime program is referring to the Path of the One. I'm hoping that the programmers on this site will have mercy on me and explain this stuff in laymen's terms so I can understand it. Anything that is shared would be GREATLY appreciated!

# CornedBee on 4 Nov 2005 4:49 PM:

I think I'm involved enough in the open source movement to answer at least some of this.

> what we try to do -- give appropriate results for the various languages based on doing the research

While certainly a possibility, you have to consider who develops Mono: like any other os project, it is driven mostly by volunteers, and coders at that. These people most likely know very little about languages. And an os project does not have the resources necessary to hire experts, like MS does. (Mono gets some money from Novell, I think, but that's probably used for development architecture.)
Perhaps it would be possible to get people from different countries help out with their own native knowledge. However, such an attempt might also just confirm what I suspect: that most people couldn't formulate collation rules for their own native language even if they tried. I certainly couldn't for German.

Regarding B, this is just impossible for various reasons.
First is the legal problem: whether the license is for actual code or just the collation algorithm and tables in an abstract form, it would affect the code of the Mono project. The Mono class libraries, however, are distributed under the LGPL, a license that aims to ensure that everyone is free to reuse the code in any way they see fit, free of charge, while not interfering too greatly with the ability to use the code in non-free software (as opposed to the GPL, which makes this impossible). As such, code that is hindered by any kind of license is not possible to put under the LGPL (or the MIT X11 license, which is somewhat less strict but still incompatible with restricted code) cannot contain code that is under a license, except if the license grants unlimited use to everyone and everything. The Ogg Theora codec stems from such a license.
The second issue is one of philosophy. You will undoubtedly encounter many people who are simply unwiling to license anything from Microsoft, no matter how beneficiary such action would be, out of general principle. While the cooler-headed people would probably prevail in the end, the Mono project would carry a stigma from that day on, which they would never get rid of.
The third issue is monetary. Given the nature of the license that is required to be LGPL-compatible, buying such a license is equivalent to paying MS to make their own code public and put it under the LGPL (with an exception for their own code). In other words, it would be _very expensive_. Even with Novell's support, it's not likely that the project could afford this.
And that's just if Microsoft would even put a price to it, which is the fourth issue. Somehow, I don't consider it likely.

In conclusion, the only way Mono is ever going to contain MS's algorithm and tables without reverse engineering them is if Microsoft decides that having an easily portable version of the .Net environment available to everyone is in their interest, and thus give their stuff to Mono for free.
Who knows? They might decide it is. It is, after all, one big advantage of Java, the most direct contender, that it will run anywhere. Anyone wishing to use any UNIX for their enterprise servers (for reasons of cost, stability, scalability, or whatever) has to use J2EE instead of .Net. Anyone wishing to write their programs so that they run anywhere has to use Java (or C++ with a cross-platform toolkit) instead of .Net. Anyone wishing to write little applications that run within the browser, with reasonable portability, has to use Java or Flash, not .Net or ActiveX - even among IE users, the trust in ActiveX is very low. IE users on Windows, that is. Mac IE doesn't support ActiveX.

IMHO, portability would serve MS well. But then, I'm writing this in a Mozilla Firefox running on Gentoo Linux on an AMD Athlon64 in native 64-bit mode. I actually have to launch a 32-bit Mozilla to run Flash movies, so I can't really claim impartiality when it comes to portability to minor platforms.

# Michael S. Kaplan on 4 Nov 2005 7:48 PM:

Hi CornedBee,

Thanks for weighing in here. I suspect that you may be correct asbout many of the issues that may serve to block progress here, on both sides.

Though I really do believe that the current plan is not necessarily in the best interests of Mono, either. An I truly would hope that the people who make those decisions would consider that, and perhaps try and come up with a better plan.

I have ill will for very few people in this world, and none of those people are involved with the Mono efforts.... :-)

# CornedBee on 6 Nov 2005 7:06 PM:

Well, one thing I wanted to add but forgot is that reverse engineering is what these people know. They might have sniffed a network protocol, shot masses of bytes at a graphics card, jumped into unknown areas of undisclosed libraries. Or they know people who did.
It's what they're comfortable with, so I don't think it's gonna change.

referenced by

2007/11/26 When yesterday's workaround becomes tomorrow's potential solution...

2007/10/12 That function is always faster! (well, except for that one case when it can actually be slower...)

go to newer or older post, or back to index or month or day