Stripping diacritics....

by Michael S. Kaplan, published on 2005/02/19 12:10 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/19/376617.aspx


Well, Jochen Neyens asked:

What's the easiest way to remove diacritic marks from characters using C#? I would like to have following function:

string RemoveDiacriticMark(string c)

Sample use:

RemoveDiacriticMark("é") -> "e"

RemoveDiacriticMark("ü") -> "u"

RemoveDiacriticMark("à") -> "a"

Well, there is not really an easy way to do it until Whidbey, but with Whidbey you can use normalization and Unicode character properties (discussed previously in FoldString.NET? No, but Whidbey has Normalization (which is kinda more cooler) and A little bit about the new CharUnicodeInfo class) to build something simple to do it all!

WARNING: This code has been improved! Get the improved version from this other post.

namespace Remove {
  using System;
  using System.Text;
  using System.Globalization;
  class Remove {
    [STAThread]
    static void Main(string[] args) {
      foreach(string st in args) {
        Console.WriteLine(RemoveDiacritics(st));
      }
    }

    static string RemoveDiacritics(string stIn) {
      string stFormD = stIn.Normalize(NormalizationForm.FormD);
      StringBuilder sb = new StringBuilder();

      for(int ich = 0; ich < stFormD.Length; ich++) {
        UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
        if(uc != UnicodeCategory.NonSpacingMark) {
          sb.Append(stFormD[ich]);
        }
      }

      return(sb.ToString());
    }
  }
}

Just put it in a file (remove.cs), compile it in Whidbey:

c:\temp\samples>csc remove.cs

and then run it!

c:\temp\samples>remove âãäåçèéêë ìíîïðñòó ôõöùúûüý
aaaaceeee
iiiiðnoo
ooouuuuy

Now in prior versions your options are more limited, though a p/invoke to the FoldString API with the MAP_COMPOSITE flag. There is also no CharUnicodeInfo class for information on Unicode properties, but you could also use a regular expression (using :Mn will give you the equivalent category). I will leave doing the regular expression as an exercise for the reader....

Enjoy!

This post brought to you by "û" (U+00fb, a.k.a. LATIN SMALL LETTER U WITH CIRCUMFLEX) 


# Johan Petersson on 19 Feb 2005 11:29 AM:

In what context would such a transformation ever be useful? I can understand the need for lossy remapping, e.g. from a Unicode encoding to ASCII, but not blindly stripping diacritics.

Sorry if this is a stupid question; I'm just curious.

# Michael Kaplan on 19 Feb 2005 11:55 AM:

Not stupid at all -- in most contexts you would probably be correct. And as I have said previously (cf: http://blogs.msdn.com/michkap/archive/2005/02/05/367666.aspx) sometimes doing so would be destructive of language content.

This was kind of code to spec, based on a question. I had a moment to kill and I realized it would show off new features, so.... :-)

# Jochen Neyens on 20 Feb 2005 12:35 AM:

I've a webform that displays a companylist with A B C .. Z paging instead of 1 2 3 ... N. After adding a new company name, using the admin interface, I would like to jump to the corresponding letter page. Suppose someone enters "één" as company name then I'd have to jump to letter page E. So, I need someting like:

JumpToPage(RemoveDiacritics(Left(cmpname,1)));

The application is written for .NET 1.1 so I cannot use the new Whidbey feature (although it's nice to know of it's existince)!

I'll have a go at the FoldString API and see how far I can get. This would actually be the first time I'd use p/invoke to call a Win32 DLl :-)

Thanks for answering my question Michael!

PS: If someone reading this blog already has has an .NET 1.1 compliant RemoveDiacritics function it would be nice to have it posted here...

# Michael Kaplan on 20 Feb 2005 1:23 AM:

Well, you will want to be careful -- since for some people those "letters with diacritics" are actually letters in their own right....

# Johan Petersson on 20 Feb 2005 10:51 AM:

Thanks for explaining Jochen. There are company names starting with non-letters too (like digits or @), which might be a problem. Basing the page index on first characters actually in use for company names might be an alternative.

# Alejandro Lapeyre on 17 Mar 2005 12:07 AM:

I would use two strings:

string1="áéíóúñ"
string2="aeioun"

and replace the occurrences of the characters in the first string with the characters of the second string.

# Evan Stein on 17 Mar 2005 12:28 PM:

First, there are hugely good reasons for stripping diacritics, mainly for the purpose of searching in plain ASCII. This is lossy stuff, but when you're going down to seven bits, that's the idea. The best method is to keep the full string for display and manipulation, and to maintain a search column alongside the original.

I have a function (actually, a class) that does this in the current C#. It's about 1000 lines, and a bit inconvenient to put in a blog, but if someone can tell me where to post the thing (or sends me a mail), I'd be delighted to share the code.

Here's how I did it (I’m in verbose mode here -- apologies):

1) Went to the Unicode database, which is actually a series of text files. Got the code ranges for the Latin set [Scripts.txt]. Got the character values from UnicodeData.txt. There are about 933 characters that qualify as Latin, in all its extensions and modifications.

2) Field 6 of UnicodeData.txt (starting from 1) has a decomposition map, and can be recursed until no more decomposition is possible. Wrote a routine to do this and write the results to a file.

3) This took care of the vast majority of values, except some of the IPA characters and the really outlandish ones, such as Anglo-Saxon characters descended from runes. There were about 100 cases where the Unicode Consortium played it safe and didn't suggest any decomposition values, and I don't blame them at all. Armed with the descriptions in UnicodeData.txt ("open, upside-down, backwards, small capital O with a ring, 2 tatoos and a piercing"), as well as a PDF showing what the characters look like, I did my best. I'd like to repeat that some of these are pretty outlandish, in case you skipped when I said that the first 4 times. I'll be shocked if anyone even notices what the choices were.

4) Originally I held the Unicode database in Oracle, since putting data into program code goes so much against the grain. I also did the recursion at runtime. But, for speed and distribution, you can't beat code, and you also can't beat pre-finished data. I got the finished data into a file, performed step 3, did some word processing ... and dropped it into C#.

5) Created a class with a static constructor, which loads the data only once. The data is in the form of a Hashtable, so that the Unicode character itself is the key to the fully normalized ASCII character. The speed seems reasonably good.

6) Wrote the StringStrip( ) function, which copies an input string to an output string character by character in a "for loop". If it encounters a character it doesn't have (i.e., non-Latin or punctuation), it simply copies that character to the output string without altering it. The one caveat is that some of the characters are diagraphs (e.g., "Dz"), so if you're in a tight spot you'll have to measure the string you get back before using it.

The ink is still wet, but reading this blog spurred me to get it done. As I said, I'd be delighted to share this, especially if I get some tips back on my ham-handed coding. If you like, you can reach me at Evan@travelogues/DOT/net.

Regards,
Evan

# Francois Beauchemin on 7 Apr 2005 1:59 PM:

Hi,

I'm trying to strip diacritics with the FoldString API. My code seem ok but the regex does not work with the Mn category but with the Sk category. What i'me doing wrong?

[Flags]
private enum MapFlags
{
MAP_FOLDCZONE = 0x00000010,// fold compatibility zone chars
MAP_PRECOMPOSED = 0x00000020,// convert to precomposed chars
MAP_COMPOSITE = 0x00000040, // convert to composite chars
MAP_FOLDDIGITS = 0x00000080 // all digits to ASCII 0-9
}

[DllImport("kernel32.dll", SetLastError=true)]
static extern int FoldString(MapFlags dwMapFlags, string lpSrcStr, int cchSrc,
[Out] StringBuilder lpDestStr, int cchDest);


public static string RemoveDiacritics(string stIn) {
StringBuilder sb = new StringBuilder();
int ret = FoldString(MapFlags.MAP_COMPOSITE , stIn, stIn.Length, sb, stIn.Length * 2);
return Regex.Replace(sb.ToString(), @"\p{Sk}", "");
}

# Michael S. Kaplan on 7 Apr 2005 2:51 PM:

I do not know what charactrs you are referring to, but you can see the code -- it is only stripping UnicodeCategory.NonSpacingMark -- you would have to strip the other category too if you wanted it gone....

# Francois Beauchemin on 7 Apr 2005 3:47 PM:

My problem is that FoldString whith MAP_COMPOSITE return a string with UnicodeCategory.ModifierSymbol instead of NonSpacingMark.

For exemple the character û (0x00FB) is expanded to 0x0015 0x005E instead of 0x0015 0x0302

Anyway I think it's ok for my case. (removing accented char from french contry name)

# Michael S. Kaplan on 7 Apr 2005 7:28 PM:

So you could modify the code to look for both UnicodeCategory.ModifierSymbol and UnicodeCategory.NonSpacingMark, rather than just UnicodeCategory.NonSpacingMark as it does now, right?

FoldString is of course not based on nomalization, as I explain at http://blogs.msdn.com/michkap/archive/2005/01/31/363701.aspx . :-)

# Quan Nguyen on 15 Apr 2005 2:17 AM:

Until Whidbey, another way to strip diacritics is after performing the NFD on the input string (decompose), use RegEx to delete all the Combining Diacritical Marks, such as:

Normalizer decomposer = new Normalizer(Normalizer.D, false);
string result = decomposer.normalize(inputString);
result = Regex.Replace(result, "\\p{IsCombiningDiacriticalMarks}+", "");

This is how it is done in VietPad.NET (http://vietpad.sf.net).

# Michael S. Kaplan on 15 Apr 2005 2:35 AM:

The only thing I know of called "Normalizer" out of Microsoft is the internal name for the Microsoft Access wizard that is officially called the Table Analyzer. It has nothing to do with Unicode normalization.

Since there is no class that will do normalization in .NET until Whidbey, I am not sure where this code would work....

# Quan Nguyen on 19 Apr 2005 11:32 PM:

I'm sorry, Normalizer is one of Unicode (ICU) Java classes that I ported to C#. It performs Unicode Normalization Forms like those that are going to be supported in Whidbey.

The point is you can strip the diacritics simply by deleting them using Regexp, rather than checking the UnicodeCategory of every character.

# Jean L.N. Hofste' on 17 Oct 2007 6:47 AM:

Stripping diacritics is tentamount to MURDER.

It is based on false economics and lazyness. Wish you tried to understand mr. Piël being stripped of his diacritic (in Dutch).

# WALDO on 10 Dec 2007 4:11 PM:

Your sample code fails to strip a diacritic mark in the second block. The fifth character should convert to a 'd', but remains as is. Is there something I'm missing?

P.S. - "Stripping diacritics is tentamount to MURDER."

Stripping diacritics is necessary when developing a URL structure based on user input. For example http://foo.bar/mrPiël/ is an ugly URL, but http://foo.bar/mrPiel/ is much friendlier.

# Michael S. Kaplan on 10 Dec 2007 5:58 PM:

I have no idea what "second block" you are referring to, Waldo.

Though you may want to look at a few of the trackbacks? And the updated code as the post mentions?

# WALDO on 11 Dec 2007 9:47 AM:

Even the updated code fails to change that character.

# Michael S. Kaplan on 11 Dec 2007 9:58 AM:

WALDO....  *what* character? *What* second block? Please provide the repro as I still have no idea what you are talking about.

# WALDO on 18 Dec 2007 11:22 AM:

c:\temp\samples>remove âãäåçèéêë ìíîïðñòó ôõöùúûüý

aaaaceeee

iiiidnoo    <-- second block, fifth character

ooouuuuy

# Michael S. Kaplan on 18 Dec 2007 11:33 AM:

Sorry, WALDO -- that is by design.

LATIN SMALL LETTER ETH does not decompose to LATIN SMALL LETTER D.

It never has and homegrown "ASCIIFICATIONS" are something you are on your own with (the given solution is based on the Unicode Standard's own published composition/decomposition mappings).

# Michael S. Kaplan on 18 Dec 2007 12:42 PM:

FYI -- The initial sample was run against an older version of the framework that had an incorrect mapping -- the problem was fixed and we now conform to Unicode here in .NET....

# WALDO on 18 Dec 2007 2:07 PM:

If that is by design, then cool. It's just that the posting produced something different than what the framework did. I just wanted you to be aware of that.

I have no concerns about fixing that particular character. I wasn't even convinced that there was a translation for that character. I just wanted to be sure I was actually getting what I was expecting. My concern was whether I was doing something wrong, or should I change my expectation to differ from the sample provided. It turns out the latter.

# ojejej on 30 Jun 2008 8:15 AM:

There is tool wReplace which removes diacritic:

http://wwidgets.com/us_wReplace.html

There is also replacement table available inside, so you can test your solutions.

Dilip on 22 Jan 2009 6:22 AM:

Thanks. This helped a lot.

My search would have been much easier if a description like European special character alphabets was included.

Jean L.N. Hofsté on 11 Feb 2010 11:33 AM:

Stripping diacritics from Names and Surnames to use the result on the Internet or in correspondence is UNLAWFUL!

In Europe every Eropean is entitled BY EUROPEAN LAW (and in Holland for instance by: Dutch Law from 1993) to have their Name spelled correctly.

EU law is being formed to enforce Hard and Software suppliers to abide.


referenced by

2009/05/27 The whole truth about MB_PRECOMPOSED and MB_COMPOSITE

2007/09/04 I am not a nudist, but I do support stripping when it is appropriate, part 1

2007/08/17 Normalize Wide Shut

2007/05/14 Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others)

2007/03/04 The non-ASCII solution to the .NET Unicode Puzzle

2006/09/22 Those letters are stripping off their diacritics in public again, the sluts!

2005/08/01 Stripping out diacritics, redux

go to newer or older post, or back to index or month or day