What makes a string meaningful?

by Michael S. Kaplan, published on 2005/02/03 03:12 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/03/366145.aspx


Yesterday, I said that CompareString prefers meaningful strings, and that while (the rare) inconsistencies are always bugs that we have to prioritize such bugs based on whether or not the data is actually valid/meaningful.

Many people stopped and wondered how one defines the word 'meaningful' here. Is it a definition that is useful for developers?

I'll ignore the cross-script strings that have little clear semantic or pragmatic meaning and focus on the strings that have code points not defined by the MS collation tables (sometimes not even by Unicode!) as discussed in The jury will give this string no weight.

Some developers might think they could use the CompareString API and compare characters to a zero length string. Others think about using LCMapString looking for a "no weight" sort key. But both of these ideas share two problems that keep them from acting as practical solutions:

  1. Checking for one character at a time is unwieldy, and more than one at a time can miss individual characters with no weight.
  2. Some Unicode code points intentionally have no weight and are valid as they are, such as U+2060 WORD JOINER.

So, what can you do? You can use the IsNLSDefinedString API! You pass it a string and it will tell you if every character in a string has a defined result (which in this case is exactly what you may need).

It is intimately related to the GetNLSVersion API, which also helps out with the question of stability in collation.

Both APIs were added in Windows Server 2003, and the Whidbey release of the .NET Framework includes a method analagous to IsNLSDefinedString (CompareInfo.IsSortable, you will see it starting in Beta 2).

GetNLSVersion is used by major databases like Active Directory in order to know when it needs to re-index their data. Basically looking at the NLSVERSIONINFO struct, the dwDefinedVersion member will be incremented any time a major version sort of change happens, and the dwNLSVersion member will be incrememented any time a minor version sort of change happens.

Now looking at IsNLSDefinedString, if you have a database and create indexes based on sort keys from LCMapString or B-Trees built from CompareString calls:

  1. Any time the major version is incremented, you should re-index no matter what, and
  2. Any time the minor version is incremented, you should re-index for any entry where IsNLSDefinedString used to return FALSE (in case it now returns TRUE or different results due to part of the string now being defined)

Obviously, major version changes are expensive and would be expected to be rare -- not even every major release of Windows requires a new major version.

Why is that? Well, usually a new version would just mean a whole bunch of new characters added, and thus there is no need to re-index strings that are already indexed -- which suggests a minor version. Minor version changes would be much more common. With them you can trust all existing index values, and only need to re-index strings that previously contained one or more unsortable elements.

If you follow principles (A) and (B) above and always store information about unsortable strings, you can use these APIs to maximize the utility of support of the collation of meaningful strings on Windows.

 

This post brought to you by "" (U+10e5, a.k.a. GEORGIAN LETTER KHAR)


no comments

referenced by

2007/07/30 See that version there? It is going down, man! #1

2005/05/04 Collation data -- must be stable, but it must not stand still

2005/03/17 The offline address book (OAB) in Exchange....

2005/03/06 Backcompat is the father of the NLS APIs

go to newer or older post, or back to index or month or day