Stripping is an interesting job (aka On the meaning of meaningless, aka All Mn characters are non-spacing, but some are more non-spacing than others)

by Michael S. Kaplan, published on 2007/05/14 13:51 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/05/14/2629747.aspx


(apologies to George Orwell, of course!) 

Val asks:

Michael,

I've been reading your "Striping Diacritics" post, and it's been a great help. I've also been comparing it with another version I've seen. This other version is similar to yours, except that it breaks up the original string into characters, then normalizes the characters individually.

I threw a few languages at the two functions. I found yours handled Vietnamese while the other one did not.

However I have a problem where I don't know the code page of the string before I feed it to the function. It may be a far-east language, it may be an european language, or it may be a far-east + european language. Further more it may be a katakana or hiragana character set (i think I got those right)

Your function corrupted japanese, chinese, and especially korean. The other function did not. However I'm hesitant to use the other function because of the problems it has with vietnamese (which is a latin character set, so it worries me that it's hiding other issues with other languages)

Is there a way to modify your function to be friendlier to languages that don't have diacritics? Right now, I've modified it that if it finds a non-latin character (based on unicode value ranges), it aborts the whole processing and returns the original string, but this obviously can't handle strings that have multiple character sets.

Fyi: the problems I've seen with yours is that if I run JP through it, it strips out things that LOOK like accents but are not. For example, on of the JP characters looks like a reverse V with a small circle on top right. After your function, the small circl on top right is missing. With korean, it seemed like it inserted blanks between each character after we run it through Normalize() function.

It's funny, I never know when I get questions like this one if people fully understand what they are asking for (no offense to Val!).

And I am not helped in this case by the fact that Val is the name of both

The latter being a stripper is interesting given that this new Val is asking me about stripping, albeit an entirely different kind of stripping!). The fact that Valerie found things about programming to be interesting only further clouds the situation. 

I have literally no sense of the person asking the question and am somewhat certain that neither of my previous Val experiences will properly guide me in this respect. :-)

I'll start with the Japanese case, which really is not a bug, or a shortcoming in the code.

After all, to me the conversion from å to a is a normal "diacritic stripping" operation, while a Swedish user  would probably throw something heavy at my head were it not for the fact that all the Swedes I have met are so bleeding polite.

I don't find that Swede's reaction to be any more or less reasonable that the Japanese user's suggestion (mirroring Val's) that the change from to is "corrupting Japanese text."

In both cases, you can see why it happens:

U+00e5 (å) decomposes to U+0061 U+030a, and U+030a is Mn (Mark, Nonspacing), which the code is designed to strip.

U+30d4 (ピ) decomposes to U+30d2 U+309a, and U+309a is  Mn (Mark, Nonspacing), which the code is designed to strip.

So in essence what Val is asking for here is a way to say get at the sentiment expressed in the title: that all Mn characters are non-spacing, but some are more non-spacing than others. Which really does not exist in any kind of automated sense.

Now the Korean problem is a bit easier to describe and deal with. After all U+d3fc (폼) decomposes to U+1111 U+1169 U+11b7, and although none of them are non-spacing, if you don't have a font that can handle the conjoining Jamo (e.g. Gulim Old Hangul, the font I have on this machine), it will look like three separate Jamo rather than one single modern Hangul syllable.

Of course, with just a minor change to the code I first presented in Stripping diacritics.... to take the decomposed text and recompose it by converting it to Unicode Normalization Form C. The new code would look something like this:

namespace Remove {
using System;
using System.Text;
using System.Globalization;
  class Remove {
    [STAThread]
    static void Main(string[] args) {
      foreach(string st in args) {
        Console.WriteLine(RemoveDiacritics(st));
      }
    }

    static string RemoveDiacritics(string stIn) {
      string stFormD = stIn.Normalize(NormalizationForm.FormD);
      StringBuilder sb = new StringBuilder();

      for(int ich = 0; ich < stFormD.Length; ich++) {
        UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]);
        if(uc != UnicodeCategory.NonSpacingMark) {
          sb.Append(stFormD[ich]);
        }
      }

      return(sb.ToString().Normalize(NormalizationForm.FormC));
    }
  }
}

I don't know what in particular in the Chinese text was going wrong, and without an example it is hard to say; perhaps the new code will resolve the problem.

But the first case (the one with Japanese) is obviously the more interesting one in terms of functionality -- the request being for a way to strip diacritics that are considered to be meaningless does demand that a bit of rigor be applied to the meaning of the term meaningless, which could be linguistically derived from the meaning of the term corrupted (by assuming that anything that is called corrupted would have been meaningful had corruption not taken place!).

 

This post brought to you by(U+30d4, a.k.a. KATAKANA LETTER PI)


# Dean Harding on 14 May 2007 6:50 PM:

The ultimate solution would be to build up your own table of "non-non-spacing marks." Bonus points if you make it locale-specific (so å would be in the English one but not the Swedish). Of course, this may be a case of overengineering ;)

# Peter Karlsson on 8 Jun 2007 3:21 AM:

Of course, in the Swedish case, a “proper” RemoveDiacritics("wüéåäöæøà") should produce “vyéåäöäöa”. But I guess implementing that could be a bit more difficult…

# Michael S. Kaplan on 8 Jun 2007 10:37 AM:

Hey Peter -- very good point. I'll have to talk about this soon....

# Steven Sudit on 19 May 2008 11:34 AM:

For another approach to normalization, take a look at: http://www.codeproject.com/KB/cs/UnicodeNormalization.aspx

I'm not sure if it's faster or slower than the method mentioned in this blog, but it is smart enough to remove diacritics only when they're attached to Latin characters, which leaves such things as Han radicals alone.

# Jan Kučera on 28 Oct 2008 8:38 PM:

Any way for the .NET Compact Framework people out there? (as the normalization does not seem to be available for them...)

# mkukik on 31 Oct 2008 5:38 AM:

In case of stripping diacritics from Latin characters:

Combining above method with converting Unicode to Western codepage, strips even more diarcritics... Like Polish Łł

However it is still not 100%

string unicodeStringOrig = "SE:ÅåÄäÖö; PL:ĄąĆćĘꣳŃńÓ󌜏źŻż; SK:ľščťžýáíéúäôň*ȍŽÝÁÍÉÚÄÔŇĎ; HU:ëőüűŐÜŰ; ES:Ññ¿; CA:àèòçï";

string unicodeString = RemoveDiacritics(unicodeStringOrig);

Encoding nonunicode = Encoding.GetEncoding(850);

Encoding unicode = Encoding.Unicode;

byte[] unicodeBytes = unicode.GetBytes(unicodeString);

byte[] nonunicodeBytes = Encoding.Convert(unicode, nonunicode, unicodeBytes);

char[] nonunicodeChars = new char[nonunicode.GetCharCount(nonunicodeBytes, 0, nonunicodeBytes.Length)];

nonunicode.GetChars(nonunicodeBytes, 0, nonunicodeBytes.Length, nonunicodeChars, 0);

string nonunicodeString = new string(nonunicodeChars);

Amino on 17 Dec 2009 1:54 AM:

Is it possible to work arround it, meaning addingDiacritics?

James White on 23 Jul 2010 8:16 AM:

Or, in roughly the same readability, in Linq & extension method

    public static string RemoveDiacritics(this string stIn)

    {

           var sb = new StringBuilder();

           sb.Append(

               stIn.Normalize(NormalizationForm.FormD)

                   .Where(c => CharUnicodeInfo.GetUnicodeCategory(c)

                               != UnicodeCategory.NonSpacingMark)

                   .ToArray());

           return (sb.ToString().Normalize(NormalizationForm.FormC));

    }

Gökhan Berberoglu on 16 Mar 2011 6:42 AM:

Thanks for the code.

However, this is not enough to save humanity from a problematic turkish characther : i without a dot.

the char somehow does not decomposes. I found bliss in .Replace("ı", "i");

btw, english and other latin languages is wrong! You should have a dot in uppercase i (that would have solved our ToLowerCase("I") problems :)


referenced by

2015/09/14 A few sorta random things I recall about some of my past memorable, unorthodox relationships, part #1

2009/05/27 The whole truth about MB_PRECOMPOSED and MB_COMPOSITE

2008/10/29 Wonder why something is not in the Compact Framework? The answer is in the question!

2007/09/04 I am not a nudist, but I do support stripping when it is appropriate, part 1

2007/08/17 Normalize Wide Shut

2007/06/08 año del ano, a.k.a. This sentence has several non-skarklish Spanish flutzpahs....

2005/02/19 Stripping diacritics....

go to newer or older post, or back to index or month or day