Collation data -- must be stable, but it must not stand still

by Michael S. Kaplan, published on 2005/05/04 02:28 -04:00, original URI:

(Apologies to Roscoe Pound, he was talking about the law here)

We add new locales to Windows with a new Windows version (or if you count the ELK locales in XPSP2 a new service pack). Or even in hotfixes in extreme situations.

And that means adding new sorts for many of those new locales. And sometimes it even means it adding new code points to the default table, if they were not there already.

But how do we do that? Or more the point, how do developers and applications deal with the repercussions of that happening between versions?

The key is that an application cannot assume that the collation data may never change. Because the truth is that it may, and it has.

But an application has a choice here -- rather than just forcing a re-index automatically with a new version, it can make use of the collation version APIs (IsNLSDefinedString and GetNLSVersion), in a method I first described when I talked about what makes a string meaningful.

I thought it was worth repeating the process again, just to get in people's systems....

Now GetNLSVersion is used by major databases like Active Directory in order to know when it needs to re-index their data. Basically looking at the NLSVERSIONINFO struct, the dwDefinedVersion member will be incremented any time a major version sort of change happens, and the dwNLSVersion member will be incrememented any time a minor version sort of change happens.

Now looking at IsNLSDefinedString, if you have a database and create indexes based on sort keys from LCMapString or B-Trees built from CompareString calls:

  1. Any time the major version is incremented, you should re-index no matter what, and
  2. Any time the minor version is incremented, you should re-index for any entry where IsNLSDefinedString used to return FALSE (in case it now returns TRUE or different results due to part of the string now being defined)

Obviously, major version changes are expensive and would be expected to be rare -- not even every major release of Windows requires a new major version.

Why is that? Well, usually a new version would just mean a whole bunch of new characters added, and thus there is no need to re-index strings that are already indexed -- which suggests a minor version. Minor version changes would be much more common. With them you can trust all existing index values, and only need to re-index strings that previously contained one or more unsortable elements.

I have mentioned that the Whidbey release of the .NET Framework includes a method analagous to IsNLSDefinedString (CompareInfo.IsSortable, now there in Beta 2). And it is the first step. The collation data is actually the same as the data that shipped in Windows Server 2003 (the first version of the sorting data that is reported by these APIs), so for all versions of the .NET Framework so far you can assume it is version 0x0001 and a GetNLSVerion equivalent is not needed.

Obviously not even Windows technically needed the GetNLSVersion API when there was only one version being tracked, but this does allow you to contrast that version with all of the earlier ones.

Those kinds of careful backcompat schemes can really help an application to re-index when appropriate and only when it is appropriate....


This post brought to you by "Э" (U+042d, a.k.a. CYRILLIC CAPITAL LETTER E)

no comments

referenced by

2007/08/28 Every character has a story #29: U+1000^H^H^H^H0f40, (TIBETAN or MYANMAR LETTER KA, depending on when you ask)

2007/07/30 See that version there? It is going down, man! #1

2005/12/07 Some sorts resist the future

go to newer or older post, or back to index or month or day