by Michael S. Kaplan, published on 2006/11/12 17:53 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/11/12/1064717.aspx
This last May, I talked about Keeping out the undesirables and talked about how the IsNLSDefinedString function took some extra conditions that it required in order to consider a string to be defined by its standards (in that case, no PUA and no unpaired surrogates, in addition to being unknown to the collation data). The same conditions apply to the managed CompareInfo.IsSortable method, by the way....
And then yesterday in Keeping out more of the undesirables, I talked about how the the System.String.Normalize method and the NormalizeString function take the law (well, the Unicode conformance rules) into their own hands by reporting an error in the case of noncharacters (I believe they also reject unpaired surrogates, which should help with Doug's concerns here!).
Although I believe the operations and conditions that these various functions and methods perform and work under are justified, one could feel that they are a bit heavy handed in their judgment. After all, PUA characters are people too and such, right? :-)
Now let's move into the encoding space for a moment, with the MultiByteToWideChar/WideCharToMultiByte functions and the System.Text.Encoding class.
All of these methods have pretty much had the notion of doing something special in the case of text that is improper, and in prior versions the rule was simple:
This may seem like the same thing, but usually it isn't; if is simply when one is told to convert using rules that consider some of the text to be undefined, deciding how to proceed.
Now there are several "code pages" used by the Encoding class that cover the entire Unicode space:
In these cases, one could argue that the "undefined" rule does not apply, since anything you can express in one of them can be expressed in any of the others.
The older behavior of all of these methods was to either let text pass through unchanged or consider it an error, depending on how the function was instructed to deal with that case.
The newer behavior, however, is to replace these invalid pieces with U+fffd (REPLACEMENT CHARACTER) rather than letting them pass through as is. A bit more social engineering here -- an effort to have a say in what one wishes to pass through here, whether one likes it or not.
Now this is a behavior change, for invalid text. And definitely worth taking note of (it is not necessarily a change for the better, in my opinion -- though that is simply a matter of opinion, and considering how easy the old way made it to emit bad text I can understand the argument, even if it does seem a little heavy-handed).
Since it is the only operation that a "UTF-16 encoding" does in its "UTF-16 LE to UTF-16 LE" conversion, it really if the only good purpose for the encoding at all in that case. Kind of a text colonic of some sort? :-)
Anyway, in looking to future versions of Windows, the question of whether to support those last four operations comes up, and then inevitably how to do so also comes up. It is kind of a good-natures argument I have with Shawn, that "not a Klingon" across the hall from me.
Both of us believe it makes sense to have some function that will act as a "colonic" for these various forms and schemes for Unicode text to clean out this garbage. No disagreement there. :-)
I'm just not sure whether treating all of them like "multibyte" encodings (which is the semantic that MultiByteToWideChar/WideCharToMultiByte calls would lend to them) is the best way to go.
I mean, since UTF-32 and Big Endian UTF-32 are the least multibyte of all the encodings, it kind of feels like a perversion of the English language to consider as UTF-16 --> UTF-32 conversion to be one of "wide char to multi byte" since if anything UTF-16 is multibyte in the case of supplementary characters when UTF-32 is not.
Perhaps I am getting hung up on the language semantic here, but it just feels wrong from a descriptive perspective, if nothing else....
One could argue that UTF7 and UTF8 are indeed multibyte encodings so that the longstanding decision to make them "code pages" makes some sense -- though they sort of sent us down this path of weird language that was doing conversions from Unicode to Unicode (albeit different forms). One could maybe think of UTF-16 and UTF-32 "code pages" as being a proof by reductio ad absurdum that even the original notion of doing any sort of Unicode conversion that simply goes to Unicode and treaying it like a code page was flawed.
Anyway, lazy thoughts for a Sunday afternoon. It's not like any of this would be shipping any time soon, but if you have opinions for or against any of this then please feel free to make them known; information like that can always help with planning, after all!
(And I would humbly suggest that the word colonic not appear in any suggested function names!)
This post brought to you by � (U+fffd, a.k.a. REPLACEMENT CHARACTER)
Xenia Tchoumitcheva on 13 Nov 2006 11:11 AM:
How about the following name for the home edition of the "Longhorn Server":
Windows Home Server
Microsoft has just registered all international domains
and so one. So finally a home version of the upcoming new server OS.
Michael S. Kaplan on 13 Nov 2006 12:10 PM:
Hmmm...... not sure if that counts as a desirable prodcuct name or not. :-)
2006/12/05 Validation of Unicode text is growing up
go to newer or older post, or back to index or month or day