Quite a [non]character, it is

by Michael S. Kaplan, published on 2007/02/14 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/02/14/1674482.aspx


Mini (no, not that mini!) asks:

I am on the ##### ####### ######## team. Hoping you can help me with a question about String.Normalize.

We got a few Watson dumps where String.Normalize threw the following exception: System.ArgumentException: Invalid Unicode code point found at index <n>. Looking at the dump, the character (0xfde2) is the one that it complains about.

I tried writing a standalone program to try this out and it does throw the exception for this character.

I tried using Encoding.Unicode.GetChars on a byte array with the value (0xfde2) and it works. 0xfde2 appears to be a valid Unicode char. Do you know why System.Normalize says it’s invalid?

Well, the problem is essentially the one I pointed out in Keeping out more of the undesirables.

And U+fde2 is actually an ideal Unicode code value to show off what is happening here, as it part of the 32 code value range from U+fdd0 to U+fdef in the Arabic Presentation Forms-A block that are all documented noncharacters with a defined suggested usage: "These codes are intended for process-internal uses, but are not permitted for interchange."

and like I said in that other posts, one of the things that happens when you use String.Normalize is that it will throw on invalid Unicode characters, and noncharacters are really not an appropriate normalization target. Mini checked out the source of the data and it appeared to be from a test case that was generating random strings, which is a bad idea for all of the possible invalid values that could be generated.

 

This post brought to you by U+fde2, a noncharacter.


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day