FoldString.NET? No, but Whidbey has Normalization (which is kinda more cooler)

by Michael S. Kaplan, published on 2005/01/31 05:56 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/31/363701.aspx


Last Friday, Jochen Kalmbach, in response to A little bit about the new CharUnicodeInfo class, asked the following:

By the way: is there some equivalent to FoldString, especially "MAP_PRECOMPOSED" and "MAP_COMPOSITE"? Neither StringInfo nor TextInfo provide such a function, or?

My answer was:

The .NET Framework has something even better than FoldString here -- I'll post on it tomorrow....

But I got busy this weekend and never got around to posting the answer to the question. Sorry about that! I'll do it now (I hope Jochen did not give up on me in the interim!).

The description of FoldString from the Platform SDK: The FoldString function maps one string to another, performing a specified transformation option.

There are many different suported transformations:

MAP_FOLDCZONE Fold compatibility zone characters into standard Unicode equivalents. For information about compatibility zone characters, see the following Remarks section. MAP_FOLDDIGITS Map all digits to Unicode characters 0 through 9. MAP_PRECOMPOSED Map accented characters to precomposed characters, in which the accent and base character are combined into a single character value. This value cannot be combined with MAP_COMPOSITE. MAP_COMPOSITE Map accented characters to composite characters, in which the accent and base character are represented by two character values. This value cannot be combined with MAP_PRECOMPOSED. MAP_EXPAND_LIGATURES Expand all ligature characters so that they are represented by their two-character equivalent. For example, the ligature 'æ' expands to the two characters 'a' and 'e'. This value cannot be combined with MAP_PRECOMPOSED or MAP_COMPOSITE.

Digit folding functionality is covered by the methods I described in CharUnicodeInfo, especially GetDecimalDigitValue. Some of the other methods will do an even fuller job, supporting many of the non-decimal digit numbers, which FoldString never handled....

The ligature functionality does not really exist right now, though that does work well in comparisons, whenever it needs to.

But the other three mapping types see new life in Whidbey, with tables that cover the Unicode 4.0 version of normalization, as described in UAX #15, UNICODE NORMALIZATION FORMS.

How does it work? Well, in the Whidbey release of the .NET Framework, two new methods were added to System.String:

bool IsNormalized(NormalizationForm normalizationForm)

string Normalize(NormalizationForm normalizationForm)

The functionality of the methods is obvious enough from the names -- the first checks if the string is in a specified normalization form, and the second puts it in a specified form.

The enumeration with the forms (NormalizationForm) has four members:

public enum NormalizationForm
{
    FormC    = 1,
    FormD    = 2,
    FormKC   = 5,
    FormKD   = 6
}

The normalization forms, which are described much more fully in the UAX#15 spec, have easy analogues to their FoldString counterparts:

FormC      MAP_PRECOMPOSED
FormD      MAP_COMPOSITE
FormKC     MAP_PRECOMPOSED | MAP_FOLDCZONE
FormKD     MAP_COMPOSITE | MAP_FOLDCZONE

In fact the only real difference is that FoldString only does part of the job, because the FoldString tables do not have all of the mappings that are in Unicode, a point I discussed previously. But these normalization methods do. So you can do all the mapping you need to in order to take equivalent forms of the same string and put them into one consistent form.

Since the "default" method used in most situations is Form C, there are also overrides to the two methods with no NormalizationForm parameter that use Form C automatically. In many cases, that is the one you may want to use. Making Form C the "default" normalization form is not an arbitrary decision -- almost all of the keyboards in that ship in Windows input text in Form C already (though of course keyboards created by MSKLC, beng user-created, can be in whatever form).

Another thing to keep in mind is that text may not be in any of these forms -- for example an atbitrary string like õĥµ¨ (U+00f5 U+0068 U+0302 U+00b5 U+00a8). This string combines a precomposed character, a composite character, and two characters with compatibility decompostions (the MICRO SIGN and the DIARESIS). It is therefore not in any one form at all. Thus this string would see an IsNormalized return of false for all forms. But it can be normalized to return the appropriate result for each of them:

Another thing to keep in mind is that text may not be in any of these forms -- for example an atbitrary string like õĥµ¨ (U+00f5 U+0068 U+0302 U+00b5 U+00a8). This string combines a precomposed character, a composite character, and two characters with compatibility decompostions (the MICRO SIGN and the DIARESIS). It is therefore not in any one form at all. Thus this string would see an IsNormalized return of false for all forms. But it can be normalized to return the appropriate result for each of them:

õĥµ¨ (00f5 0068 0302 00b5 00a8) --> Form C  ---> õĥµ¨ (00f5 0125 00b5 00a8)
õĥµ¨ (00f5 0068 0302 00b5 00a8) --> Form D  ---> õĥµ¨ (006f 0303 0068 0302 00b5 00a8)
õĥµ¨ (00f5 0068 0302 00b5 00a8) --> Form KC --> õĥμ ̈  (00f5 0125 03bc 0020 0308)
õĥµ¨ (00f5 0068 0302 00b5 00a8) --> Form KD --> õĥμ ̈  (006f 0303 0068 0302 03bc 0020 0308)

Ideally they would always compare as being equal even if the forms are different, but this is definitely not a 100% of the time result, as I pointed out a few months ago when I answered the question Normalization and Microsoft -- whats the story? Therefore normalization is the one way you can use to make sure that you will always get the right comparison, especially in some cases that may not ever be fully supported in comparison, like "ﷺ" (U+fdfa, a.k.a. ARABIC LIGATURE SALLALLAHOU ALAYHE WASALLAM), which decomposes to:

صلى الله عليه وسلم

(0635 0644 0649 0020 0627 0644 0644 0647 0020 0639 0644 064A 0647 0020 0648 0633 0644 0645)

(Since most fonts do not support U+fdfa, if you can see the string above then it points to at least one time that normalization Form D helped out for a lot of people!)

You can also see the Beta documentation for the IsNormalized method, the Normalize method, and the NormalizationForm enumeration.

 

This post brought to you by "ﷻ" (U+fdfb, a.k.a. ARABIC LIGATURE JALLAJALALOUHOU)
A liagture that decomposes to "جل جلاله" or
062c 0644 0020 062c 0644 0627 0644 0647.

 


# Jochen Kalmbach on 31 Jan 2005 10:54 PM:

Thanx! Is this correct that the tables are implemented in the .NET-Framework, and the functions are not switching to unmanaged-mode to do the lookup?

# Michael Kaplan on 31 Jan 2005 10:57 PM:

The managed code is not moving to the Win32 API -- the better results alone speak to that. :-)

# sch on 1 Feb 2005 3:26 AM:

In the "brought to you by" section the link goes to the wrong character. But with a fancy name, I must admit ;) (Is it Arabic, Uighur or Kirghiz? All of them? ;)
Thanks for your postings!

# Michael Kaplan on 1 Feb 2005 4:07 AM:

Sch -- Good catch, I was linking to U+fbfb (ARABIC LIGATURE UIGHUR KIRGHIZ YEH WITH HAMZA ABOVE WITH ALEF MAKSURA INITIAL FORM) rather than U+fdfb.

I assume it must be used in both Uighur and Kirghiz....

# Vorn on 3 Feb 2005 6:59 PM:

Apparently MacOS X comes with U+fdfa in some font or other, but not U+fdfb. Ah well.

Vorn

referenced by

2008/09/25 When to make a change, when to stay the same

2007/10/29 Microsoft is a Form 'C' shop, Part 1

2006/01/14 Getting out of the compatibility zone, redux

2005/12/03 When even the bugs seem cool

2005/12/02 Getting out of dodge (or at least out of the compatibility range!)

2005/11/11 What to do with the Vietnamese keyboard on Windows?

2005/04/30 Normalization vs. .NET text elements

2005/03/15 Emptying some items out of the suggestion box

2005/02/27 Some suggested updates to the Win32-->.NET mapping for NLS functions....

2005/02/19 Stripping diacritics....

go to newer or older post, or back to index or month or day