Not all stroke counts are created equal

by Michael S. Kaplan, published on 2005/10/16 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/10/16/481447.aspx

One of the problems with sort weights that had to fixed in Vista is the various stroke count sorts.

You see, everyone wanted to add more ideographs, but the original weights left no room for additions. So there was no way to add them and still have them be valid stroke count sorts! But we can't just do nothing, not only because of those customer requests but because the bulk of the ideographs people wanted to add had default weights -- so the perceived behavior from customers is broken sorts (several bugs came in descriving such problems, over the years). We obviously had to do something for people.

Ok, so we have to declare a new major version and then make sure to assign weights in a way that leaves room for the future. This is a trap to not get caught in again, believe me.

But what was fascinating was looking at the source data from each of the subsidiary contacts -- each one of them had a different method of doing the stroke counts:

Looking to the future when new characters might need to be added, room has to be left for such additions in each case. Which is what we ended up doing.

Of course inevitably the question came up -- would we expose this information in some way? Or even should we? I mean who is the customer who needs that level of detail? It obviously does not make sense to give it to competitors, any more than a map maker allows people to copy their maps and resell them. And we do not even have the luxury that the map makers have of adding fictional streets to the map to catch the copiers!

Well, maybe there is some form in which the information would be useful on the platform. It is always worthwhile to think about such things.

Of course even if it made sense to do, the problems in doing so are obviously not trivial -- what generic function could be written that would provide all of that information, when you consider the vast array of different sources and methods?

And doesn't the same issue exist in other languages? I mean obviously the issues are different, but there might still be a commonality, a reason to expose more than we do today to support some type of functionality.

This is a conversation that spans versions and definitely is one for beyond Vista. It might fit into a sensible area for providing functionality in the future. What makes sense to expose, exactly? Perhaps knowing the answer to the WHAT question might lead to sensible answers for the HOW and WHEN questions....

This port brought to you by "𠂉" (U+20089, a CJK Extension B ideograph that looks a bit like a circle)

Just curious, how do you handle characters where the number of strokes can vary depending on how people write them? U+4E16 comes to mind, I learned it as 4 strokes but I've read that it can be written in 5 as well (by writing the "U" shaped part as 3 strokes instead of 2).

The data that is given to us make the assumption on official ways to write the characters, espcially the data for China which acually explicitly has a stroke order attached to it.

If you look at any ideograph, there may be many ways to write it but there is genedrally just one 'official' way to do so according to educators/academics. Where it can get interesting is when those 'official' sources vary, of course.... :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.