Stripping diacritics....

In what context would such a transformation ever be useful? I can understand the need for lossy remapping, e.g. from a Unicode encoding to ASCII, but not blindly stripping diacritics.

Sorry if this is a stupid question; I'm just curious.

Not stupid at all -- in most contexts you would probably be correct. And as I have said previously (cf: http://blogs.msdn.com/michkap/archive/2005/02/05/367666.aspx) sometimes doing so would be destructive of language content.

This was kind of code to spec, based on a question. I had a moment to kill and I realized it would show off new features, so.... :-)

I've a webform that displays a companylist with A B C .. Z paging instead of 1 2 3 ... N. After adding a new company name, using the admin interface, I would like to jump to the corresponding letter page. Suppose someone enters "één" as company name then I'd have to jump to letter page E. So, I need someting like:

JumpToPage(RemoveDiacritics(Left(cmpname,1)));

The application is written for .NET 1.1 so I cannot use the new Whidbey feature (although it's nice to know of it's existince)!

I'll have a go at the FoldString API and see how far I can get. This would actually be the first time I'd use p/invoke to call a Win32 DLl :-)

Thanks for answering my question Michael!

PS: If someone reading this blog already has has an .NET 1.1 compliant RemoveDiacritics function it would be nice to have it posted here...

Well, you will want to be careful -- since for some people those "letters with diacritics" are actually letters in their own right....

Thanks for explaining Jochen. There are company names starting with non-letters too (like digits or @), which might be a problem. Basing the page index on first characters actually in use for company names might be an alternative.

I would use two strings:

string1="áéíóúñ"
string2="aeioun"

and replace the occurrences of the characters in the first string with the characters of the second string.

First, there are hugely good reasons for stripping diacritics, mainly for the purpose of searching in plain ASCII. This is lossy stuff, but when you're going down to seven bits, that's the idea. The best method is to keep the full string for display and manipulation, and to maintain a search column alongside the original.

I have a function (actually, a class) that does this in the current C#. It's about 1000 lines, and a bit inconvenient to put in a blog, but if someone can tell me where to post the thing (or sends me a mail), I'd be delighted to share the code.

Here's how I did it (I’m in verbose mode here -- apologies):

1) Went to the Unicode database, which is actually a series of text files. Got the code ranges for the Latin set [Scripts.txt]. Got the character values from UnicodeData.txt. There are about 933 characters that qualify as Latin, in all its extensions and modifications.

2) Field 6 of UnicodeData.txt (starting from 1) has a decomposition map, and can be recursed until no more decomposition is possible. Wrote a routine to do this and write the results to a file.

3) This took care of the vast majority of values, except some of the IPA characters and the really outlandish ones, such as Anglo-Saxon characters descended from runes. There were about 100 cases where the Unicode Consortium played it safe and didn't suggest any decomposition values, and I don't blame them at all. Armed with the descriptions in UnicodeData.txt ("open, upside-down, backwards, small capital O with a ring, 2 tatoos and a piercing"), as well as a PDF showing what the characters look like, I did my best. I'd like to repeat that some of these are pretty outlandish, in case you skipped when I said that the first 4 times. I'll be shocked if anyone even notices what the choices were.

4) Originally I held the Unicode database in Oracle, since putting data into program code goes so much against the grain. I also did the recursion at runtime. But, for speed and distribution, you can't beat code, and you also can't beat pre-finished data. I got the finished data into a file, performed step 3, did some word processing ... and dropped it into C#.

5) Created a class with a static constructor, which loads the data only once. The data is in the form of a Hashtable, so that the Unicode character itself is the key to the fully normalized ASCII character. The speed seems reasonably good.

6) Wrote the StringStrip( ) function, which copies an input string to an output string character by character in a "for loop". If it encounters a character it doesn't have (i.e., non-Latin or punctuation), it simply copies that character to the output string without altering it. The one caveat is that some of the characters are diagraphs (e.g., "Dz"), so if you're in a tight spot you'll have to measure the string you get back before using it.

The ink is still wet, but reading this blog spurred me to get it done. As I said, I'd be delighted to share this, especially if I get some tips back on my ham-handed coding. If you like, you can reach me at Evan@travelogues/DOT/net.

Regards,
Evan

Hi,

I'm trying to strip diacritics with the FoldString API. My code seem ok but the regex does not work with the Mn category but with the Sk category. What i'me doing wrong?

[Flags]
private enum MapFlags
{
MAP_FOLDCZONE = 0x00000010,// fold compatibility zone chars
MAP_PRECOMPOSED = 0x00000020,// convert to precomposed chars
MAP_COMPOSITE = 0x00000040, // convert to composite chars
MAP_FOLDDIGITS = 0x00000080 // all digits to ASCII 0-9
}

[DllImport("kernel32.dll", SetLastError=true)]
static extern int FoldString(MapFlags dwMapFlags, string lpSrcStr, int cchSrc,
[Out] StringBuilder lpDestStr, int cchDest);

public static string RemoveDiacritics(string stIn) {
StringBuilder sb = new StringBuilder();
int ret = FoldString(MapFlags.MAP_COMPOSITE , stIn, stIn.Length, sb, stIn.Length * 2);
return Regex.Replace(sb.ToString(), @"\p{Sk}", "");
}

I do not know what charactrs you are referring to, but you can see the code -- it is only stripping UnicodeCategory.NonSpacingMark -- you would have to strip the other category too if you wanted it gone....

My problem is that FoldString whith MAP_COMPOSITE return a string with UnicodeCategory.ModifierSymbol instead of NonSpacingMark.

For exemple the character û (0x00FB) is expanded to 0x0015 0x005E instead of 0x0015 0x0302

Anyway I think it's ok for my case. (removing accented char from french contry name)

So you could modify the code to look for both UnicodeCategory.ModifierSymbol and UnicodeCategory.NonSpacingMark, rather than just UnicodeCategory.NonSpacingMark as it does now, right?

FoldString is of course not based on nomalization, as I explain at http://blogs.msdn.com/michkap/archive/2005/01/31/363701.aspx . :-)

Until Whidbey, another way to strip diacritics is after performing the NFD on the input string (decompose), use RegEx to delete all the Combining Diacritical Marks, such as:

Normalizer decomposer = new Normalizer(Normalizer.D, false);
string result = decomposer.normalize(inputString);
result = Regex.Replace(result, "\\p{IsCombiningDiacriticalMarks}+", "");

This is how it is done in VietPad.NET (http://vietpad.sf.net).

The only thing I know of called "Normalizer" out of Microsoft is the internal name for the Microsoft Access wizard that is officially called the Table Analyzer. It has nothing to do with Unicode normalization.

Since there is no class that will do normalization in .NET until Whidbey, I am not sure where this code would work....

I'm sorry, Normalizer is one of Unicode (ICU) Java classes that I ported to C#. It performs Unicode Normalization Forms like those that are going to be supported in Whidbey.

The point is you can strip the diacritics simply by deleting them using Regexp, rather than checking the UnicodeCategory of every character.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.