Stripping diacritics....

by Michael S. Kaplan, published on 2005/02/19 09:10 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2005/02/19/376617.aspx


Well, Jochen Neyens asked:

What's the easiest way to remove diacritic marks from characters using C#? I would like to have following function:

string RemoveDiacriticMark(string c)

Sample use:

RemoveDiacriticMark("é") -> "e"

RemoveDiacriticMark("ü") -> "u"

RemoveDiacriticMark("à") -> "a"

Well, there is not really an easy way to do it until Whidbey, but with Whidbey you can use normalization and Unicode character properties (discussed previously in FoldString.NET? No, but Whidbey has Normalization (which is kinda more cooler) and A little bit about the new CharUnicodeInfo class) to build something simple to do it all!

namespace Remove {
  using System;
  using System.Text;
  using System.Globalization;
  class Remove {
    [STAThread]
    static void Main(string[] args) {
      foreach(string st in args) {
        Console.WriteLine(RemoveDiacritics(st));
      }
    }

    static string RemoveDiacritics(string stIn) {
      string stFormD = stIn.Normalize(NormalizationForm.FormD);
      StringBuilder sb = new StringBuilder();

      for(int ich = 0; ich < stFormD.Length; ich++) {
        UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
        if(uc != UnicodeCategory.NonSpacingMark) {
          sb.Append(stFormD[ich]);
        }
      }

      return(sb.ToString());
    }
  }
}

Just put it in a file (remove.cs), compile it in Whidbey:

c:\temp\samples>csc remove.cs

and then run it!

c:\temp\samples>remove âãäåçèéêë ìíîïðñòó ôõöùúûüý
aaaaceeee
iiiidnoo
ooouuuuy

Now in prior versions your options are more limited, though a p/invoke to the FoldString API with the MAP_COMPOSITE flag. There is also no CharUnicodeInfo class for information on Unicode properties, but you could also use a regular expression (using :Mn will give you the equivalent category). I will leave doing the regular expression as an exercise for the reader....

Enjoy!

This post brought to you by "û" (U+00fb, a.k.a. LATIN SMALL LETTER U WITH CIRCUMFLEX) 


# Johan Petersson on Saturday, February 19, 2005 11:29 AM:

In what context would such a transformation ever be useful? I can understand the need for lossy remapping, e.g. from a Unicode encoding to ASCII, but not blindly stripping diacritics.

Sorry if this is a stupid question; I'm just curious.

# Michael Kaplan on Saturday, February 19, 2005 11:55 AM:

Not stupid at all -- in most contexts you would probably be correct. And as I have said previously (cf: http://blogs.msdn.com/michkap/archive/2005/02/05/367666.aspx) sometimes doing so would be destructive of language content.

This was kind of code to spec, based on a question. I had a moment to kill and I realized it would show off new features, so.... :-)

# Jochen Neyens on Sunday, February 20, 2005 12:35 AM:

I've a webform that displays a companylist with A B C .. Z paging instead of 1 2 3 ... N. After adding a new company name, using the admin interface, I would like to jump to the corresponding letter page. Suppose someone enters "één" as company name then I'd have to jump to letter page E. So, I need someting like:

JumpToPage(RemoveDiacritics(Left(cmpname,1)));

The application is written for .NET 1.1 so I cannot use the new Whidbey feature (although it's nice to know of it's existince)!

I'll have a go at the FoldString API and see how far I can get. This would actually be the first time I'd use p/invoke to call a Win32 DLl :-)

Thanks for answering my question Michael!

PS: If someone reading this blog already has has an .NET 1.1 compliant RemoveDiacritics function it would be nice to have it posted here...

# Michael Kaplan on Sunday, February 20, 2005 1:23 AM:

Well, you will want to be careful -- since for some people those "letters with diacritics" are actually letters in their own right....

# Johan Petersson on Sunday, February 20, 2005 10:51 AM:

Thanks for explaining Jochen. There are company names starting with non-letters too (like digits or @), which might be a problem. Basing the page index on first characters actually in use for company names might be an alternative.

# Alejandro Lapeyre on Thursday, March 17, 2005 12:07 AM:

I would use two strings:

string1="áéíóúñ"
string2="aeioun"

and replace the occurrences of the characters in the first string with the characters of the second string.

# Evan Stein on Thursday, March 17, 2005 12:28 PM:

First, there are hugely good reasons for stripping diacritics, mainly for the purpose of searching in plain ASCII. This is lossy stuff, but when you're going down to seven bits, that's the idea. The best method is to keep the full string for display and manipulation, and to maintain a search column alongside the original.

I have a function (actually, a class) that does this in the current C#. It's about 1000 lines, and a bit inconvenient to put in a blog, but if someone can tell me where to post the thing (or sends me a mail), I'd be delighted to share the code.

Here's how I did it (I’m in verbose mode here -- apologies):

1) Went to the Unicode database, which is actually a series of text files. Got the code ranges for the Latin set [Scripts.txt]. Got the character values from UnicodeData.txt. There are about 933 characters that qualify as Latin, in all its extensions and modifications.

2) Field 6 of UnicodeData.txt (starting from 1) has a decomposition map, and can be recursed until no more decomposition is possible. Wrote a routine to do this and write the results to a file.

3) This took care of the vast majority of values, except some of the IPA characters and the really outlandish ones, such as Anglo-Saxon characters descended from runes. There were about 100 cases where the Unicode Consortium played it safe and didn't suggest any decomposition values, and I don't blame them at all. Armed with the descriptions in UnicodeData.txt ("open, upside-down, backwards, small capital O with a ring, 2 tatoos and a piercing"), as well as a PDF showing what the characters look like, I did my best. I'd like to repeat that some of these are pretty outlandish, in case you skipped when I said that the first 4 times. I'll be shocked if anyone even notices what the choices were.

4) Originally I held the Unicode database in Oracle, since putting data into program code goes so much against the grain. I also did the recursion at runtime. But, for speed and distribution, you can't beat code, and you also can't beat pre-finished data. I got the finished data into a file, performed step 3, did some word processing ... and dropped it into C#.

5) Created a class with a static constructor, which loads the data only once. The data is in the form of a Hashtable, so that the Unicode character itself is the key to the fully normalized ASCII character. The speed seems reasonably good.

6) Wrote the StringStrip( ) function, which copies an input string to an output string character by character in a "for loop". If it encounters a character it doesn't have (i.e., non-Latin or punctuation), it simply copies that character to the output string without altering it. The one caveat is that some of the characters are diagraphs (e.g., "Dz"), so if you're in a tight spot you'll have to measure the string you get back before using it.

The ink is still wet, but reading this blog spurred me to get it done. As I said, I'd be delighted to share this, especially if I get some tips back on my ham-handed coding. If you like, you can reach me at Evan@travelogues/DOT/net.

Regards,
Evan

# Francois Beauchemin on Thursday, April 07, 2005 1:59 PM:

Hi,

I'm trying to strip diacritics with the FoldString API. My code seem ok but the regex does not work with the Mn category but with the Sk category. What i'me doing wrong?

[Flags]
private enum MapFlags
{
MAP_FOLDCZONE = 0x00000010,// fold compatibility zone chars
MAP_PRECOMPOSED = 0x00000020,// convert to precomposed chars
MAP_COMPOSITE = 0x00000040, // convert to composite chars
MAP_FOLDDIGITS = 0x00000080 // all digits to ASCII 0-9
}

[DllImport("kernel32.dll", SetLastError=true)]
static extern int FoldString(MapFlags dwMapFlags, string lpSrcStr, int cchSrc,
[Out] StringBuilder lpDestStr, int cchDest);


public static string RemoveDiacritics(string stIn) {
StringBuilder sb = new StringBuilder();
int ret = FoldString(MapFlags.MAP_COMPOSITE , stIn, stIn.Length, sb, stIn.Length * 2);
return Regex.Replace(sb.ToString(), @"\p{Sk}", "");
}

# Michael S. Kaplan on Thursday, April 07, 2005 2:51 PM:

I do not know what charactrs you are referring to, but you can see the code -- it is only stripping UnicodeCategory.NonSpacingMark -- you would have to strip the other category too if you wanted it gone....

# Francois Beauchemin on Thursday, April 07, 2005 3:47 PM:

My problem is that FoldString whith MAP_COMPOSITE return a string with UnicodeCategory.ModifierSymbol instead of NonSpacingMark.

For exemple the character û (0x00FB) is expanded to 0x0015 0x005E instead of 0x0015 0x0302

Anyway I think it's ok for my case. (removing accented char from french contry name)

# Michael S. Kaplan on Thursday, April 07, 2005 7:28 PM:

So you could modify the code to look for both UnicodeCategory.ModifierSymbol and UnicodeCategory.NonSpacingMark, rather than just UnicodeCategory.NonSpacingMark as it does now, right?

FoldString is of course not based on nomalization, as I explain at http://blogs.msdn.com/michkap/archive/2005/01/31/363701.aspx . :-)

# Quan Nguyen on Friday, April 15, 2005 2:17 AM:

Until Whidbey, another way to strip diacritics is after performing the NFD on the input string (decompose), use RegEx to delete all the Combining Diacritical Marks, such as:

Normalizer decomposer = new Normalizer(Normalizer.D, false);
string result = decomposer.normalize(inputString);
result = Regex.Replace(result, "\\p{IsCombiningDiacriticalMarks}+", "");

This is how it is done in VietPad.NET (http://vietpad.sf.net).

# Michael S. Kaplan on Friday, April 15, 2005 2:35 AM:

The only thing I know of called "Normalizer" out of Microsoft is the internal name for the Microsoft Access wizard that is officially called the Table Analyzer. It has nothing to do with Unicode normalization.

Since there is no class that will do normalization in .NET until Whidbey, I am not sure where this code would work....

# Quan Nguyen on Tuesday, April 19, 2005 11:32 PM:

I'm sorry, Normalizer is one of Unicode (ICU) Java classes that I ported to C#. It performs Unicode Normalization Forms like those that are going to be supported in Whidbey.

The point is you can strip the diacritics simply by deleting them using Regexp, rather than checking the UnicodeCategory of every character.

referenced by

2009/05/27 The whole truth about MB_PRECOMPOSED and MB_COMPOSITE

2007/09/04 I am not a nudist, but I do support stripping when it is appropriate, part 1

2007/08/17 Normalize Wide Shut

2007/05/14 Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others)

2007/03/04 The non-ASCII solution to the .NET Unicode Puzzle

2006/09/22 Those letters are stripping off their diacritics in public again, the sluts!

2005/08/01 Stripping out diacritics, redux

go to newer or older post, or back to index or month or day