Pretending the vowels aren't there

by Michael S. Kaplan, published on 2006/03/20 09:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/03/20/555320.aspx

The other day, Suzanne was talking about Hebrew searches (it was good that she finally posted, or else people might have started talking about Suzanne searches?).

are at some level the same string and in some cases it is entirely reasonable to want a search to find them both.

and now (wait for it!) with vowels but while including the NORM_IGNORENONSPACE flag to remove the secondary weights:

Ah, there is our answer -- if we have a search capability to optionally ignore the secondary weights, it will find both strings.

This does not exist in Google or Live.com, and although it appears to exist in Word, it does not work:

Hi Ilya....

Well, I think since it has existed in Windows since long before it did in Unicode, that we'll probably stick with our own ways of talking about it. :-)

It might be a useful potential addition to consider for TR30, though it is probably better in most cases to handle in collation so that it is not a destrutive operation to the points that are present....

Mike, could you explain how the concept of "collation" fits in here? I always thought it was simply a fancy word for "sorting order".
I always imagined the technique to be: Fold both the searched string and the index string and then compare them, thus eliminating unnecessary details introduced by those Unicode geeks.

And yeah, I guess you folks *do* have the privilege of using your own terms :) This problem goes beyond the domain of Microsoft -- as you mentioned, this is also a problem with Google. While at it, I'd also wish Wikipedia's article names were insensitive to EN-DASH vs. MINUS-HYPHEN vs. MAQAF differences, so we could use the proper character (e.g. the fancy HEBREW PUNCTUATION MAQAF) in the article name and yet make it feasible for normal people to stumble on it by enetering its name (without using Character Map, funny key-combos, 20-meter-long Unicode keyboards -- heck, without at all being aware of the difference between Maqaf and Minus). It'd be great if we could tell every such project to simply implement the (not yet-)standard set of Unicode character foldings and be done with it.

Hi Ilya,

It is actually described in the post above -- these items are given a secondary weight in sorting -- so if you ignore secondary weights, then the two strings will be considered equal, without having to modify the underlying strings.