Searching for supplementary characters

by Michael S. Kaplan, published on 2005/10/24 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/10/24/483965.aspx

What is happening here is obvious if you consider that for most purposes 'Unicode' in Microsoft applications is UTF-16LE. This means that a supplementary character, thus U+10000 and U+10001 are actually U+d800 U+dc00 and U+d800 U+dc01, which if you tried to use in a range or with one character wildcards, would most certainly always fail since the notion of a 'single character' is not met by these multiple character entities.

Now this does not mean that it isn't a bug; if not a bug, then it is at least a design limitation since other parts of the system work so hard to act like it is a single character. The fix would involve extending the underlying range checking to allow multiple characters when surrogate pairs are used and treating a 'single character replace' as applying to one UTF-16 code point or a supplementary character. This is not impossible, but is certainly a differnt class of solution and the performance hit may or may not be worth the trouble.

One thing that is certain -- it is not supporting combining characters -- thus searching for U+00e5 (our good friend 'a Ring') does not find U+006a U+030a ('a plus combining ring'), which is just as reasonable of an extension here. It actually helps put the problem in context, as clearly Word is not using the user notion of a character, it is using the term to mean individual code points.

Though maybe that sort of thing will change if FindNLSString starts to see wider usage in applications after it has been out there for a while. It allows the user's notion of a character to have more control here. Although some may balk at how the behavior will change with user settings rather than being consistent across all situations, I think the behavior makes sense and is certainly explainable.

In the meantime, search in Word is not using the user notion of a 'character', and that is that. If enough people start finding this blog entry, it may reach the point of being a 'known limitation' but a KB article might serve that purpose more effectively. :-)

This post brought to you by "𐀁" (U+10001, a.k.a. LINEAR B SYLLABLE B038 E)

The fileformat.info pages for U+10001 are broken by the way, they contain meaningless byte sequences such as EDA080 and EDB081 which are an encoding of the surrogate codepoints used only by UTF-16, the correct UTF-8 encoding is FO908081.

On a system with a Linear B font available you'll see that the "browser test page" shows the HTML escaped versions correctly but the inline UTF-8 encoding is wrong in both places that it is used.

I've sent feedback telling the owner about this problem, meanwhile you might want to consider linking elsewhere for characters outside the BMP.

Office is obviously very difficult to fix, the last time I looked it didn't really seem to know what a character was, everything appeared to either be hard coded or reliant on font encodings. The result is that the meaning and appearance of an Office document is specific to the context (locale, installed fonts etc.) of the system that created it. What a headache.

Also, I wonder if collation order is available for these characters anyway? The original questioner seems to assume, perhaps rather naively, that the range U+10000 to U+10001 means exactly those two characters, but the users of Linear B were they around to tell us, might be of the opinion that "of course" anyone who specified such a range intended to include the as-yet unexplained symbol B018 aka U+10050, just as a French user would expect to find é when searching for a-z. I presume Word gets the acute accent example right?

If Windows doesn't include collation for the CJK extension then it's sort-of meaningless to specify that you want a "range" of such characters, you're implicitly peering beneath the surface of the system to look at codepoints.

All supplementary characters do have a code point ordering that is actually good enough for many purposes, and have had this since Windows XP (before that they had no weight). Range checking certainly would be a useful feature if it were available....

The fileformat.info page contains some valid info including valid UTF-16 data, which is the one of primary interest on Windows anyway, so I do not have plans to change the mapping, especislly since it is still the best all around way to find images for lots of these characters!

In reply to Nick's second comment, it does work that way on my English (XP Pro, Off 2K3) system. But exactly HOW isn't documented in the Office help.

A search for [a-z] finds both accented and unaccented lower-case letters, as well as ç.

Searching for [e-f] will find e, é, ê, è and f.

A search for [e-é] will find e or é but not ê or è. A search for [e-ê] will find e, é, ê and also è. Similarly, a search for [c] alone won't find ç, but a search for [c-d] will.

So it appears to a naive (sorry, naïve) end-user like me that accented characters are "between" the regular form of the letter and the next letter—that é, ê and è are after e and before f, though it's hard to figure out that order.

Of course it doesn't work that way. Word won't let you search for the range [ê-z], complaining that "The Find What text contains a range that is not valid." I assume that is because ê comes after z in Unicode order.

So when Word looks at such a range, it seems to be guessing what kind of collation to use, and there doesn't seem to be a way for the end-user to control it, or even to know what is going on.

Which raises, for me, a lot of questions about how consistent a search operation is going to be. This behavior is consistent with what a French user would expect, but might not be what users from another locale would want (and we know well from this blog how often such expectations collide).

In Word there's the added complication that all text is tagged with a Language; does this mean that a range will have a different meaning for different parts of a document, according to the language it's in (I doubt it)?

Here's why this is troubling to me:

• It's hard to know beforehand what's going to happen when your search for a range, because it isn't well-documented (bad)

• I can't be sure that the same search will work the same way on other people's systems (very bad, especially if I'm distributing VBA code)

And I haven't even started to play around with the search order for CJK ranges!