How to NOT Parse Unicode Digits, or How to: Parse Unicode Digits... NOT!
by Michael S. Kaplan, published on 2006/04/26 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/04/26/583390.aspx
I have talked about digit substitution many times in the past.
I was reminded of it recently when developer Kollen pointed put a pretty lame article:
I have a bug on parsing fullwidth Unicode digits and I noticed the (poorly named) "How to: Parse Unicode Digits" section of the .NET docs that indicates this behavior is by design. Are there any alternatives? Why doesn't this API support more than just the ASCII equivalent digits in Unicode?
Kollen is right, this article is poorly named. But beyond that, what is the benefit of putting together a huge function with no other purpose than to not work? How many times will people not read the surrounding text and copy/paste that code into their applications?
Yuck!
Well, at least the answer to Kollen's other question can be found here.... :-)
This post brought to you by "9" (U+ff19, FULLWIDTH DIGIT NINE)
(a Unicode charascter that is dressed to the nines....)
# Maurits [MSFT] on 26 Apr 2006 6:01 PM:
> what is the benefit of putting together a huge function with no other purpose than to not work?
Well, it sort of works:
"The attempts to parse ASCII digits and ASCII digits specified as Unicode code values succeed."
So, in other words, Decimal.Parse can handle 16-bit characters.
But it can't (yet?) parse numbers higher than U+00FF.
It's good that this behavior is documented. This is the kind of grey area which customers swear is a bug: "If you can ACCEPT U+0660, why can't you PARSE it??"
... but which developers swear is creeping featurism*: "It's simple for customers to modify the input string... just subtract (U+0660 - U+0030) for every character in the range U+0660 to U+0669, etc."
It doesn't seem technically difficult to write a Decimal.ParseUnicode that could handle all of these:
http://www.fileformat.info/info/unicode/category/Nd/list.htm
but it may be more trouble than it's worth!
*
http://en.wikipedia.org/wiki/Creeping_featurism
# Maurits [MSFT] on 26 Apr 2006 6:30 PM:
Hmmm... is there even a way to know that U+01d7ff is a digit, short of hardcoding it? GetStringTypeW(...) doesn't work for supplemental code points. Is there a new character-class-looker-upper-thing in Vista?
# Maurits [MSFT] on 26 Apr 2006 6:39 PM:
# Maurits [MSFT] on 26 Apr 2006 7:20 PM:
Hey all those digits are assigned in consecutive blocks, zero to nine! I could write a completely [Nd]-driven Decimal.Parse as follows:
int Decimal.Parse(string) {
blow up if string is null or "";
int i = 0;
for each Unicode character (not UTF-16 character) in the string {
i *= 10;
i += DecimalValueOf(that character);
}
}
int DecimalValueOf(Unicode character) { // not UTF-16 character
blow up unless character is in [Nd] class;
int i = 1;
for (;;)
{
check the character class of (character - i);
if it's Nd, break;
}
return (i - 1) % 10; // allow for consecutive blocks of 10
}
I worry about (little-endian vs. big-endian)/(LTR vs. RTL)... that should be tested.
Also this would break if a future block of digits was nonconsecutive. :(
# Michael S. Kaplan on 26 Apr 2006 11:02 PM:
There is no unmanaged equivalent for full UCD information that includes supplementary characters, though it is the sort of thing that is being consdidered for the future....
# Maurits [MSFT] on 27 Apr 2006 11:23 AM:
I can understand why native full support for supplementary characters would be a relatively low priority, given that
* Hardly anybody uses them
* They're hard (especially since natively everything's UTF-16)
* Any app who really cares could include
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt
and a parsing routine
# Maurits [MSFT] on 27 Apr 2006 12:09 PM:
# Michael S. Kaplan on 27 Apr 2006 12:10 PM:
Well, I am not saying that it isn't important -- but how to support it (and properties in general) given the function we have (GetStringTypeW) is not an easy task to do.
Though I have my own opinions, and I may blog about them some day....
# Maurits [MSFT] on 27 Apr 2006 2:30 PM:
> UnicodeData.txt even includes the numeric value of the N* characters
Except for two...
U+09F8 BENGALI CURRENCY NUMERATOR ONE LESS THAN THE DENOMINATOR
U+2183 ROMAN NUMERAL REVERSED ONE HUNDRED
# Maurits [MSFT] on 27 Apr 2006 2:54 PM:
# Maurits [MSFT] on 27 Apr 2006 8:19 PM:
This is some good documentation, right here:
http://msdn2.microsoft.com/en-us/library/fw9t1kbk(VS.80).aspx
I especially like that it points out the U+0F33 TIBETAN DIGIT HALF ZERO case. Important because it illustrates the necessity of checking the return for EXACTLY the sentinel value -1, and not just broadly checking < 0.
if (GetDigitValue(ch) == -1) // OK
if (GetDigitValue(ch) < 0) // ERROR: misses U+0F33 !!
One of the very rare instances where using == on a double is the Right Thing... though I suppose
if (abs(GetDigitValue(ch) + 1) < 0.005) // digit value is -1ish
would also work
# Maurits [MSFT] on 27 Apr 2006 8:28 PM:
Er, I meant GetNumericValue(ch), of course.
Please consider a
donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.
referenced by
go to newer or older post, or back to index or month or day