Keeping out the undesirables?

by Michael S. Kaplan, published on 2006/05/31 23:45 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/05/31/612544.aspx


The NLS API function IsNLSDefinedString is an exercise in social engineering within software.

Perhaps I should explain what the hell I am talking about. :-)

This function takes a string and essentially gives you a judgment about whether this string is one you can pass to the collation functions in the NLS API and expect to have something along the lines of reasonable, supportable results.

The process is simple. It enumerates every UTF-16 code unit in the string, and uses the following tests to make its decision.

Clearly, this is not a linguistic judgment, since the conditions are easily stated. Every UTF-16 code unit in the string:

  1. has weight or is on a small list of valid weightless code units;
  2. Is not in the PUA;
  3. Is not an unpaired surrogate.

Calling a string that is does not pass this test INVALID has interesting consequences, since it means that IsNLSDefinedString is not just returning whether to expect determinsm in collation function results. If that were the case then only the point #1 would be needed.

Two questions come up at this point:

Question #1: Why judge the PUA so harshly, if NLS collation functions will return deterministic results?

The issue here is that the private use area has no real context or meaning beyond that created by private agreement. Therefore, there is no way that NLS collation functions can treat such a string as being valid, since its meaning is unknown to the operating system.

So IsNLSDefinedString makes sure that situations that require an answer to the question of determinism are not given false answers based on strings that do not have a known, valid value.

Question #2: Why judge unpaired surrogate code points to harshly, if NLS collation functions will return deterministic results?

The issue here is that an unpaired surrogate is given the same status in Unicode as an undefined code point, so IsNLSDefinedString returns FALSE here just as it would for any other undefined code point.

So if you use IsNLSDefinedString, you are being influenced to do certain things with your application to make sure that these "undesirable" code units are not treated as being valid.

A very geeky form of social engineering, as NLS tries to make the character "neighborhood" a nicer place for the other characters to live!

Could this be expanded in the future to take care of other sequences such as too many diacritics and other potential undesirables? Well, perhaps -- in a new major version only though, of course -- but the line so far has been drawn to differentiate between what has clear meaning in Unicode vs. what does not; it is unclear whether it makes sense in the long run to extend the coverage to handle implementation-specific limits....

 

This post brought to you by U+00ad, a.k.a. SOFT HYPHEN


# Maurits [MSFT] on 2 Jun 2006 12:10 PM:

If I want to limit my Active Directory to (say) Unicode 3.1, can I pass that in the NLSVERSIONINFO object?  Do the NLS version numbers correspond to the Unicode releases?

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/nls_Versioning.asp

referenced by

2011/08/12 There's no "I" in IDN part 8: Punycode don't do the PUA

2011/06/24 An irresistible force walks into an immovable object (aka the Thai that binds us)

2007/12/08 Social engineering in Windows Explorer....

2006/11/12 Maybe it is the name that is 'Undesirable' ?

2006/11/11 Keeping out more of the undesirables

2006/10/22 It is Clear[Type] how the quality is being managed

2006/07/22 Behind the return of the Unicode IME

2006/07/10 The PUA isn't complex enough

go to newer or older post, or back to index or month or day