Keeping out more of the undesirables

by Michael S. Kaplan, published on 2006/11/12 00:36 -05:00, original URI:

Remember my posts about stripping diacritics using normalization?

Well, Feroze does. Just the other day he asked me:

I had spoke to you a couple of months ago, about normalizing a string to remove diacritics, and to expand some ligatures. You had pointed me to some code on your blog which shows how to do this.

Anyway all that is working fine for us. However, we have a problem – in that our test team is passing a string to our function, and that is causing String.Normalize(NormalizationForm.FormD) to throw an ArgumentException, saying that the Unicode code point at a certain index is invalid.

So I had a look at the code, and it seems that our test team is doing the following to generate a random string: First they generate a random character, and add that to a string using append.

Char c = (char)rand.Next(32,65535)
S = s + c;

Now, I am not at all familiar with Unicode, but do you think that these two lines do not give a valid string always? Are all values in the range [32,65535] valid ordinal for Unicode characters?

Sorry if this is a silly question – I looked up the Unicode standard but could not figure out from it whether this was valid or not.


Of course the answer is that the range of code units from 0x0020 (32) to 0xffff (65535) contains plenty of code units that are not valid in Unicode. Some are permanently reserved and will never be valid, some are roadmapped for future assignment, and some are just in the small pool of "nothing there yet but one day something could end up being assigned to the spot". Not all of those cases cause errors in nomalization, but deoending on what the ciode is trying to do, iut could be leading to invalid test cases.

All of this led to an obvious anticipatory follow-up question from Feroze:

Thanks for your reply. The only followup question I have is: why isn't the cast to Char from int throwing, or the append to string of an invalid Unicode character?

Is that a bug?

This is an excellent question, actually. In my opinion this is not a bug at all. Some might classify it as a design limitation in the datatype, and others may feel it borders more on a design flaw. But it does fall somewhere on that particular axis of interpretation.... :-)

But let me explain why I do not feel that this is a bug.

The problem here is that the System.String datatype (like it's WCHAR/LPWSTR ancestors) which is documented that it "represents text as a series of Unicode characters", is forgiving about any and all code units that are not valid Unicode code points, but the System.String.Normalize method, and its unmanaged counterpart the NormalizeString function, which implements Unicode normalization, is not quite so forgiving.

Though in order to maximize their usefulness, these methods do allow all of those "not yet assigned" code units; it is only the ones that are reserved and which never represent valid Unicode data that are blocked. Which is why the code Feroze asked about will work just fine until it randomly doesn't.

The Unicode terminology here has them described as noncharacters in section 16.7 of Unicode 5.0 (more on 5.0 soon!):

Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are forbidden for use in open interchange of Unicode text data. See Section 3.4, Characters and Encoding, for the formal definition of noncharacters and conformance requirements related to their use.

Since these code points cannot ever be valid in Unicode data, they make a potentially interesting set of sentinel or other such values that an implementation can use for a particular algorithm (kind of like what the console was doing with 0x0100 as I described in Ā was unexpected at this time, only having it not be a bug since the values are actually reserved and cannot ever be used....

Now of course System.String can be used for both publicly interchanged strings and ones used only internally, so the use of non-characters in the latter is not necessarily a bug. The decision to make it illegal to pass them through Microsoft's implementation of the Unicode normalization algorithm (see Unicode Standard Annex #15: Unicode Normalization Forms) is a choice that is not prescribed but it is also not forbidden -- think of it as another nice bit of social engineering to help keep these noncharacters from being publicly interchanged. :-)


this post brought to you by U+FFFE, one of the most famous of all of the noncharacters

Doug on 12 Nov 2006 2:11 AM:

You'll need to check for more than just invalid code points. You also need to check for incorrect usage of chars in the surrogate range.

Michael S. Kaplan on 12 Nov 2006 2:43 AM:

Very true, Doug. Though the rules about intercharacter invalidity are different than code points with noncharacter status....

This is also an issue that comes up in other contexts, such as code pages. I'll be covering this in a post coming up very soon (perhaps even tomorrow some time!).

Michael S. Kaplan on 12 Nov 2006 2:44 AM:

(Note that the orginal "keeping out the undesirables" post featured unpaired surrogates quite prominently!)

Alan McFarlane on 13 Nov 2006 8:43 AM: talks of a tool of his/hers which:

"will generate random strings of Unicode characters between the ranges of U+0020 and U+FFFF up to 65,535 characters in length either as a fixed length string or a random length string. The ranges of Unicode code points that are not assigned to a language script, and special areas such as Private use and surrogate areas are excluded from the generated strings."

Haven't looked at it further...


Michael S. Kaplan on 13 Nov 2006 11:03 AM:

Yeah, this one may or may not have a problem here -- kind of eerie, all the similarities, huh? :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2007/02/14 Quite a [non]character, it is

2006/11/12 Maybe it is the name that is 'Undesirable' ?

go to newer or older post, or back to index or month or day