Pretending the vowels aren't there

by Michael S. Kaplan, published on 2006/03/20 09:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/03/20/555320.aspx


The other day, Suzanne was talking about Hebrew searches (it was good that she finally posted, or else people might have started talking about Suzanne searches?).

So let us look at the issue a bit. The problem is that the two strings:

ארץ־ישראל

and

אֶרֶץ־יִשְׂרָאֵל

are at some level the same string and in some cases it is entirely reasonable to want a search to find them both.

So, let's look at the sort keys for them, with vowels:

12 02 12 1a 12 17 07 6b 12 0b 12 1b 12 1a 12 02 12 0e 01 2a 2a 02 02 28 57 2c 29 01 01 01 00

and without them:

12 02 12 1a 12 17 07 6b 12 0b 12 1b 12 1a 12 02 12 0e 01 01 01 01 00

and now (wait for it!) with vowels but while including the NORM_IGNORENONSPACE flag to remove the secondary weights:

12 02 12 1a 12 17 07 6b 12 0b 12 1b 12 1a 12 02 12 0e 01 01 01 01 00

Ah, there is our answer -- if we have a search capability to optionally ignore the secondary weights, it will find both strings.

Note that these same principles apply to the trope marks that are used to indicate how to chant in Hebrew -- they get ignored via the same flag.

(You can ignore symbols to get rid of that hyphen looking thing in there, the Maqaf!)

This does not exist in Google or Live.com, and although it appears to exist in Word, it does not work:

Ah well, something everyone can work on for the future! :-)

 

This post brought to you by "ל" (U+05dc, a.k.a. HEBREW LETTER LAMED)


# Ilya Konstantinov on 20 Mar 2006 11:12 AM:

Well, that's basically what Unicode calls character folding (http://www.unicode.org/reports/tr30/). It's peculiar, though, that there's no character folding for eliminating Hebrew diacritics (a.k.a. Nikkud).

# Michael S. Kaplan on 20 Mar 2006 11:16 AM:

Hi Ilya....

Well, I think since it has existed in Windows since long before it did in Unicode, that we'll probably stick with our own ways of talking about it. :-)

It might be a useful potential addition to consider for TR30, though it is probably better in most cases to handle in collation so that it is not a destrutive operation to the points that are present....

# Ilya Konstantinov on 20 Mar 2006 12:45 PM:

Mike, could you explain how the concept of "collation" fits in here? I always thought it was simply a fancy word for "sorting order".
I always imagined the technique to be: Fold both the searched string and the index string and then compare them, thus eliminating unnecessary details introduced by those Unicode geeks.

And yeah, I guess you folks *do* have the privilege of using your own terms :) This problem goes beyond the domain of Microsoft -- as you mentioned, this is also a problem with Google. While at it, I'd also wish Wikipedia's article names were insensitive to EN-DASH vs. MINUS-HYPHEN vs. MAQAF differences, so we could use the proper character (e.g. the fancy HEBREW PUNCTUATION MAQAF) in the article name and yet make it feasible for normal people to stumble on it by enetering its name (without using Character Map, funny key-combos, 20-meter-long Unicode keyboards -- heck, without at all being aware of the difference between Maqaf and Minus). It'd be great if we could tell every such project  to simply implement the (not yet-)standard set of Unicode character foldings and be done with it.

# Michael S. Kaplan on 20 Mar 2006 1:05 PM:

Hi Ilya,

It is actually described in the post above -- these items are given a secondary weight in sorting -- so if you ignore secondary weights, then the two strings will be considered equal, without having to modify the underlying strings.

go to newer or older post, or back to index or month or day