Those letters are stripping off their diacritics in public again, the sluts!

by Michael S. Kaplan, published on 2006/09/22 09:59 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/09/22/765618.aspx


Have you ever noticed how the bigger fan someone seems to be of your blog, the more likely they are to ask a question about something that is actually already covered in some prior blog post? :-)

Like that question Charles Bocock asked the other day that I covered the other day in They speak English in other places, too.

Like the other day when Feroze asked me:

...I have been reading your excellent blog on glob/loc issues and use it as a resource to clarify any questions that I have.

In our product, we need to generate... a map of users in the corporate directory. This map is used to lookup users by punching their lastnames on the telephone keypad. For eg, to lookup GATES, a caller would punch in 42837 on the telephone keypad.

The issue that we are facing is that there are some names which have letters with Accents/Circumflex/Diaresis etc. For such letters we want to use the equivalent character from the ASCII charset. For eg, for Latin letter A with Acute, we want to use ‘A’.

Is there a simple framework API that will do this transformation for us, or do we have to write a mapping table for all Unicode codepoints?

Thanks for your valuable time,

Charles Bocock then actually asked another question in the Suggestion Box that was hauntingly similar:

Thanks for your excellent reply to my last question about an English version of Windows ;)

OK, here is something else I could never find a solution for. I used to work with SMS (text messages) a lot, and I tried to find a solution to get the base character from a combination (character with diacritic).

The reason for this is because text messages are costly, can only contain 160 characters, and the character set is missing a bunch of useful diacritic combinations (no circumflex for example).

GSM Character Set:
http://www.csoft.co.uk/sms/character_sets/gsm.htm

You can send messages in Unicode, but then you're down to only 80 characters per message.

An option we wanted was to strip diacritics from some characters (e.g. ê -> e).

Is there a way of doing this in .NET?

Of course both Feroze and Charles may have just been imitating Dean Harding, who did the very same thing and inspired Stripping out diacritics, redux  in August of last year.

Dean's was even funnier since he mentioned how he had been reading my blog for years (it was not really that old at the time!). I think everyone was just trying to be polite.

And all three of them are basically answered by Stripping Diactrics from February of last year, which points out that normalizing to form D and then removing anything with a combining character class will do a great deal of what they want here.

Bob only knows I never read this blog, and I never search in it unless I am looking for a post I know is there. So I can see where folks are coming from in terms of not knowing about the post with the answer.

But then on the other hand I never claimed I liked myself, either. :-)

 

This post brought to you by (U+1f87, a.k.a. GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI)


# Dean Harding on 24 Sep 2006 7:34 PM:

It just felt like years :p~

# Michael S. Kaplan on 24 Sep 2006 7:50 PM:

I've had relationships like that from time to time....

# Charles Bocock on 25 Sep 2006 6:46 AM:

Damn, there are no emoticons on here with little embarrassed red cheeks for me to use :)

referenced by

2007/08/17 Normalize Wide Shut

go to newer or older post, or back to index or month or day