by Michael S. Kaplan, published on 2005/08/02 02:56 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/08/01/446486.aspx
This last week, Dean Harding asked in the suggestion box:
Hey Michael, after all these years of reading your blog I finally got a question for you (your topics have always been so well-covered that I never needed to suggest anything before, but now I got a specific problem which I hope you can help me with :)
Anyway, one of the most interesting parts of my job is that I get to do a lot of interfacing with SMS (and I use the term "interesting" as in that old Chinese proverb ;) and one of the things about SMS is that it has a very limited character set.
Now, one of the new applications we're setting up is essentially a database of live events (shows, bands, DJs, etc) which you can access via the web and via SMS. Occasionally, however, you'll get a band with non-US-ASCII characters in their name (most popular are of course the famous Umlauts!) but of course accented characters and so forth cannot be displayed in SMS.
Now, we don't want to miss out the Umlauts and such for the web interface, but for SMS we don't have much choice. So there's a couple of solutions. First is we have two fields in the database, the "web" name and the "sms" name - but that's no good cause it means we have to keep both up-to-date.
The solution I was hoping to go for was to do a simple run through an Encoding.GetBytes followed by Encoding.GetString with a US-ASCII encoding. My hope was that this would be equivalent to WideCharToMultiByte followed by MultiByteToWideChar /without/ the WC_NO_BEST_FIT_CHARS flag which would convert all the accented characters to their non-accented equivalents. But that doesn't seem to be the case - they get converted to ?'s which is no good.
I was hoping for a .NET-only solution, but it looks like I'll have to p/invoke the WideCharToMultiByte/MultiByteToWideChar calls. Unless you've got some good news for me :)
Well, I did start posting in November of last year, so technically it has been "years", but it really has been less than a full year that I have been blogging, Dean. :-)
But I can definitely speak against using encoding support directly to support the plan here -- mainly because the "best fit" support in the Win32 encoding API is not really a completely firm way to take out all of the diacritics!
Offhand, I would way the best way is the Stripping Diacritics... post I did this last February, which will handle this case quite well and quite a bit more completely than the Win32 encoding APIs in concert will do.
Or if you really wanted to do it through encoding you could use the .NET Framework 2.0 support custom encoding fallbacks with the ASCII encoding to simply drop anything you wanted to and replace it with whatever you like, including the ASCII-fied version of text sans diacritics....
Is that close enough to good news? At least since I include the warning about using the Win32 functions? :-)
This post brought to you by "è" (U+00e8, a.k.a. LATIN SMALL LETTER E WITH GRAVE)
A character that might have resented having its grave stripped from it, but then realizxd that it meant in your application it might a little further from the grave due to the joy of pronunciational ambiguities!
# Dean Harding on 2 Aug 2005 3:40 AM:
# Michael S. Kaplan on 2 Aug 2005 5:28 AM:
# Crissov on 26 Aug 2005 6:28 PM:
referenced by