Stripping out diacritics, redux

by Michael S. Kaplan, published on 2005/08/02 02:56 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/08/01/446486.aspx


This last week, Dean Harding asked in the suggestion box:

Hey Michael, after all these years of reading your blog I finally got a question for you (your topics have always been so well-covered that I never needed to suggest anything before, but now I got a specific problem which I hope you can help me with :)

Anyway, one of the most interesting parts of my job is that I get to do a lot of interfacing with SMS (and I use the term "interesting" as in that old Chinese proverb ;) and one of the things about SMS is that it has a very limited character set.

Now, one of the new applications we're setting up is essentially a database of live events (shows, bands, DJs, etc) which you can access via the web and via SMS. Occasionally, however, you'll get a band with non-US-ASCII characters in their name (most popular are of course the famous Umlauts!) but of course accented characters and so forth cannot be displayed in SMS.

Now, we don't want to miss out the Umlauts and such for the web interface, but for SMS we don't have much choice. So there's a couple of solutions. First is we have two fields in the database, the "web" name and the "sms" name - but that's no good cause it means we have to keep both up-to-date.

The solution I was hoping to go for was to do a simple run through an Encoding.GetBytes followed by Encoding.GetString with a US-ASCII encoding. My hope was that this would be equivalent to WideCharToMultiByte followed by MultiByteToWideChar /without/ the WC_NO_BEST_FIT_CHARS flag which would convert all the accented characters to their non-accented equivalents. But that doesn't seem to be the case - they get converted to ?'s which is no good.

I was hoping for a .NET-only solution, but it looks like I'll have to p/invoke the WideCharToMultiByte/MultiByteToWideChar calls. Unless you've got some good news for me :)

Well, I did start posting in November of last year, so technically it has been "years", but it really has been less than a full year that I have been blogging, Dean. :-)

But I can definitely speak against using encoding support directly to support the plan here -- mainly because the "best fit" support in the Win32 encoding API is not really a completely firm way to take out all of the diacritics!

Offhand, I would way the best way is the Stripping Diacritics... post I did this last February, which will handle this case quite well and quite a bit more completely than the Win32 encoding APIs in concert will do.

Or if you really wanted to do it through encoding you could use the .NET Framework 2.0 support custom encoding fallbacks with the ASCII encoding to simply drop anything you wanted to and replace it with whatever you like, including the ASCII-fied version of text sans diacritics....

Is that close enough to good news? At least since I include the warning about using the Win32 functions? :-)

 

This post brought to you by "è" (U+00e8, a.k.a. LATIN SMALL LETTER E WITH GRAVE)
A character that might have resented having its grave stripped from it, but then realizxd that it meant in your application it might a little further from the grave due to the joy of pronunciational ambiguities!


# Dean Harding on 2 Aug 2005 3:40 AM:

Heh, OK fair enough, it hasn't been years :)

Hmm, I must've missed (or at least forgot about) that other post of yours. I was thinking about this a bit over the weekend, and I was thinking that I could probably do something like that as well.

The other thing is, of course, that SMS does allow for *some* diacritics on *some* characters, but only half a dozen or so. It's a bit unfortunate that we're not using .NET 2.0 yet, cause it's got some really nice features that I could have made use of!

Ah well, perhaps what I'll have to do is the FoldString method with a bit of custom logic to allow for the couple of "extended" characters allowed by the SMS protocol...

Thanks for the advice, anyway - much obliged!

# Michael S. Kaplan on 2 Aug 2005 5:28 AM:

Think of my blog as being like the world of film, where you can make ten minutes seem like eight! :-)

# Crissov on 26 Aug 2005 6:28 PM:

Um, doesn’t GSM-SMS support UCS-2? Of course with a decreased max length of 70 characters (1120/16 = 70, 1120/7 = 160). Actual implementation in phones might be a problem, though.

referenced by

2007/09/04 I am not a nudist, but I do support stripping when it is appropriate, part 1

2007/08/17 Normalize Wide Shut

2007/03/04 The non-ASCII solution to the .NET Unicode Puzzle

2006/09/22 Those letters are stripping off their diacritics in public again, the sluts!

go to newer or older post, or back to index or month or day