Keeping out the undesirables?

by Michael S. Kaplan, published on 2006/05/31 23:45 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/05/31/612544.aspx

The NLS API function IsNLSDefinedString is an exercise in social engineering within software.

Perhaps I should explain what the hell I am talking about. :-)

This function takes a string and essentially gives you a judgment about whether this string is one you can pass to the collation functions in the NLS API and expect to have something along the lines of reasonable, supportable results.

The process is simple. It enumerates every UTF-16 code unit in the string, and uses the following tests to make its decision.

Does the code unit have weight?
- If the answer is NO, then find out if a small list of characters that are considered valid despite their weightlessness, like U+00ad (SOFT HYPHEN)?
  - If the answer to that question is NO, then return FALSE -- this is an undefined code unit as far as the operating system knows.
  - If the answer to that question is YES, then continue the test.
- If the answer is YES, then continue the test.
Is the code unit in the PUA (Private Use Area) of Unicode?
- If the answer is YES, then return FALSE.
- If the answer is NO, then continue the test.
Is the code unit a low surrogate?
- If the answer is YES, then return FALSE -- an unpaired surrogate code unit was found.
- If the answer is NO, then continue the test.
If the code unit a high surrogate?
- If the answer is YES, then is the next code point a low surrogate?
  - If the answer is YES, then skip one additional code unit and continue the test.
  - If the answer is NO, then return FALSE -- an unpaired surrogate code unit was found.
- If the answer is NO, then continue the test.
If you made it to this point, then proceed to the next code unit. If you are at the end of the string then return TRUE.

Clearly, this is not a linguistic judgment, since the conditions are easily stated. Every UTF-16 code unit in the string:

has weight or is on a small list of valid weightless code units;
Is not in the PUA;
Is not an unpaired surrogate.

Calling a string that is does not pass this test INVALID has interesting consequences, since it means that IsNLSDefinedString is not just returning whether to expect determinsm in collation function results. If that were the case then only the point #1 would be needed.

Two questions come up at this point:

Question #1: Why judge the PUA so harshly, if NLS collation functions will return deterministic results?

The issue here is that the private use area has no real context or meaning beyond that created by private agreement. Therefore, there is no way that NLS collation functions can treat such a string as being valid, since its meaning is unknown to the operating system.

So IsNLSDefinedString makes sure that situations that require an answer to the question of determinism are not given false answers based on strings that do not have a known, valid value.

Question #2: Why judge unpaired surrogate code points to harshly, if NLS collation functions will return deterministic results?

The issue here is that an unpaired surrogate is given the same status in Unicode as an undefined code point, so IsNLSDefinedString returns FALSE here just as it would for any other undefined code point.

So if you use IsNLSDefinedString, you are being influenced to do certain things with your application to make sure that these "undesirable" code units are not treated as being valid.

A very geeky form of social engineering, as NLS tries to make the character "neighborhood" a nicer place for the other characters to live!

Could this be expanded in the future to take care of other sequences such as too many diacritics and other potential undesirables? Well, perhaps -- in a new major version only though, of course -- but the line so far has been drawn to differentiate between what has clear meaning in Unicode vs. what does not; it is unclear whether it makes sense in the long run to extend the coverage to handle implementation-specific limits....

This post brought to you by U+00ad, a.k.a. SOFT HYPHEN

# Maurits [MSFT] on 2 Jun 2006 12:10 PM:

If I want to limit my Active Directory to (say) Unicode 3.1, can I pass that in the NLSVERSIONINFO object? Do the NLS version numbers correspond to the Unicode releases?

http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/nls_Versioning.asp

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2011/08/12 There's no "I" in IDN part 8: Punycode don't do the PUA

2011/06/24 An irresistible force walks into an immovable object (aka the Thai that binds us)

2007/12/08 Social engineering in Windows Explorer....

2006/11/12 Maybe it is the name that is 'Undesirable' ?

2006/11/11 Keeping out more of the undesirables

2006/10/22 It is Clear[Type] how the quality is being managed

2006/07/22 Behind the return of the Unicode IME

2006/07/10 The PUA isn't complex enough

go to newer or older post, or back to index or month or day