You must have heard wrong, Jesse\ I don't know about tailoring\ But about the algorithm Jesse\ That is used by Microsoft...

by Michael S. Kaplan, published on 2008/07/06 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/07/06/8694239.aspx

Apologies to the misuse of the Zero Mostel/Chaim Topol Tevye's Dream in the title...

Over in the ever-shrinking Suggestion Box, Jesse Hallam asks:

Hi Michael,

I've been scouring the net for some discussion of how one goes about tailoring a default collation table. Specifically, how does one correctly re-weight the table? ICU does it, but doesn't do a particularly good job of describing how. The UCA talks about it, but mentions very few particulars.

In my case, I'm interested in tailoring the DUCET, but I wondered if perhaps you could share some insight into how Microsoft generates the resulting weights from their default table.

As a Microsoft employee, I can't claim any real knowledge or understanding of ICU, the International Components for Unicode. I've never looked at their code to be able to tell you exactly how tailorings work under it.

The UCA itself doesn't really give the kind of implementation details that Jesse would be looking for either. Though if you look at my A&P of Sort Keys series, particularly A&P of Sort Keys, part 4 (aka It isn't a race but let's make an EXCEPTION and cross the Finnish line).

And if you need real detail, the Windows Protocols documentation, particularly the section of Unicode String Comparison, goes in probably much deeper than most humans would ever want to go on the subject.

Though it is hard to know how much any of this will help, since the answer for how to tailor is so dependent on how one is implementing collation in the first place....

This blog brought to you by ං (U+0d82, aka SINHALA SIGN ANUSVARAYA)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day