The non-ASCII solution to the .NET Unicode Puzzle

by Michael S. Kaplan, published on 2007/03/04 03:10 -05:00, original URI:

So anyway, I was pointed to Chris Mullins' .NET Unicode Puzzle and was struck by the irony of the use of the ASCII code page rather than the CharUnicodeInfo class (which I used for my own solution to the problem in Stripping Diacritics).

I don't mean the irony of how he went on and on about discovering the use of normalization in the solution. I mean, that isn't ironic, that just means he didn't see the article. But even regular readers can miss a post, let alone folks who don't read the blog. So that isn't ironic.

The irony for me was the way Chris went on in the end:

Whenever I drop into doing Unicode related tasks, I'm always amazed at the sheer bredth of the Unicode standard. There is so much information in there, and so many powerfull features that it's easy to quickly become overwhelmed.

It's easy too to forget that everthing we do these days on a computer is leveraging Unicode. Prettymuch everything is encoded in either UTF-8 or UTF-16 - all web pages, all XML documents, all text files stored on your hard disk. Unicode is at the heart of Windows, Linux, .Net & Java. Despite this, very few developers have any real understanding of what Unicode is, or how it works. I've been asking 'What does that UTF-8 or UTF-16 mean that you've typed in a zillion times?" during interviews now for years, and have yet to ever get back the right answer (although I've sure had some creative responses!).

Isn't it just a little bit ironic that he says so much about the power of Unicode and how no one understands it, while the solution to the problem pivots through the ASCII encoding which allows almost nothing in Unicode through?

For an example of the kind of character that his solution won't work for, see the rather irked sponsoring character, below! :-)


This post brought to you by ΣΆ (U+04e2, a.k.a. CYRILLIC CAPITAL LETTER I WITH MACRON)

# Dean Harding on 4 Mar 2007 5:51 PM:

Yeah, I saw that post on the unicode list, and I was following along right up until the point he decided to use the ASCII encoder...

At least this time I DID recall those posts you made about stripping diacritics :p~

# Michael S. Kaplan on 4 Mar 2007 9:53 PM:

I figured that one would stick. :-)

# Mihai on 5 Mar 2007 12:11 PM:

Because the understanding of Unicode is so fuzzy, and the terminology is often improperly used, I avoid asking questions like "what does that X mean" or "how do you define Y"

I tend to ask for stories on *how* stuff works (utf, surrogates, normalization, whatever).

# Chris Mullins on 5 Mar 2007 2:16 PM:

In all fairness, your solution using the CharUnicodeInfo class is certainly the better answer. If you don't mind, I'll link to it from my (now obsolete) blog entry...

I looked into this class a while back when I was working with string prep, and wasn't able to get the information I needed - specifically the Bidi information relevant to a codepoint. Since then, I've never looked at it again - to my loss, obviously.

I'm modifying my blog now... So that at least to the historical record, I don't look like quite such a fool.

# Michael S. Kaplan on 5 Mar 2007 2:45 PM:

I don't mind, Chris. :-)

FWIW, the class has Bidi info, it just got made internal for some reason (see this post for an example of getting at it)....

# Chris Mullins on 5 Mar 2007 3:06 PM:

I don't think I can use that BiDi solution, as it requires signifigant user privlidges in order to call private methods via reflection. I've currently got all the BiDi info from RFC 3454 in tables, and am doing table based lookups to determine BiDi information. I guess I'll have to stick with that.

Any chance the .NET 2.0 SP1 will publically expose these methods?

# Michael S. Kaplan on 5 Mar 2007 3:28 PM:

There are no current plans for that (it is hard to add features in service packs), but I'll push to see if this could be added at the next appropriate point....

referenced by

2007/08/17 Normalize Wide Shut

go to newer or older post, or back to index or month or day