See that version there? It is going down, man! #2 (aka Everybody WYNNs)

by Michael S. Kaplan, published on 2007/07/31 19:12 -04:00, original URI:

When I posted part one of this two part series, I should have guessed that Dean Harding would have a good answer:

Well, I'd say if the major version changes *either way* you should re-index.

If a minor version changes backwards, you probably have to re-index as well. Because otherwise, you basically have to re-index strings that *now* return FALSE for IsNLSDefinedString and you couldn't do that without scanning them all -- so you may as well just re-index everything.

To be honest, though, I don't see how you could have done anything differently. If you wanted to skip the "re-index on minor version decrement", I believe you would have to change IsNLSDefinedString so that it returned the minimum version that the string would have been defined in, and I can't imagine how big the data file would have to be to make that possible :-)

This is spot on correct -- either major or minor version changes would require you to re-index, and there really is no way around that without adding a whole bunch of metadata about the nature if the changes and/or Dean's suggestion (information about character age with the same weight). The only alternative would be if one had actual knowledge of the changes because one happened to be working there, or one happened to hear about the change.

For example -- and this is a not entirely contrived example as it may happen one day -- let's say that the prevailing practice in Sweden were to change along the lines I discussed in Why do we call w 'double u' -- doesn't it look more like a 'double v'? and the Swedish Academy's change to split W and V and give them both primary weights were picked up in some future version of Windows.

To be perfectly honest I expect it to happen eventually, though we could even be looking multiple versions into the future before it was generally expected by the public.

(I am actually curious how fast it will be in these modern times, and whether the inertia surrounding such updates will hold it back and if so for how long!)

Okay, so that is a major version change for sv-SE (0x041d) that may or may not apply to sv-FI (0x081d) and which definitely wouldn't apply to fi-FI (0x040b). But only a major version change in the NLS version, not the defined version -- since the repertoire is unchanged.

We have done it before (ref: The disunification of Norwegian and Danish sorting), so that part is no problem.

I am sure I'll even talk about it here, I may even know a bit of Swedish by then!

In this particular case though, even if that one specific locale (or two specific locales!) change, the change itself actually has no point in worrying about IsNLSDefinedString results since none of them would have changed (this kind of underscores the reason why IsNLSDefinedString results would not be useful!), but it may be worth scanning and only re-indexing strings that contain one of the following characters (these are the ones that move for this change in Swedish/Finnish, I don't know whether the WYNN would actually move in this case or not):

and not touching the rest. Glancing through at a few Swedish corpuses, I think from a performance standpoint it might be cheaper to approach it that way in such a case....

Now clearly this might not be typical. But on the other hand it might be, to strike the proper correctness/compatibility balance!


This post brought to you by ƿ (U+01bf, a.k.a. LATIN LETTER WYNN)

Dean Harding on 31 Jul 2007 8:01 PM:

That hyopthetical Swedish change is interesting. It's a bit of a chicken-and-egg problem really, isn't it? I mean, computers being as prevalent as they are, such a change could not be considered "mainstream" until computers sorted in the "new" way, but at the same time, Microsoft wouldn't want to update the Windows sort tables until the change becomes "widespread" on its own...

Michael S. Kaplan on 31 Jul 2007 8:11 PM:

Well, people start feeling more and more like the computers have it wrong, like happened with Norwegian/Danish -- how long that might take is a really interesting issue (one can compare it with spelling reform in France or other countries, or ORNL changes in Tamil, or other orthography and other changes that have happend over the last thirty years).

So not entirely chicken vs. egg, I think there is a way to move forward....

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2010/03/09 Coloring outside the lines in the a-ness of the Hungarian Technical Sort

2007/07/31 If this post really describes a bug, would I actually put it in the WYNN column?

go to newer or older post, or back to index or month or day