Encodings in Strings are Evil Things (Part 7.1)

by Michael S. Kaplan, published on 2005/01/11 01:27 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/10/350450.aspx


I just read Ryan's post Encodings in Strings are Evil Things (Part 7), and all I do is hope that anyone who reads it will have also at the very least read Part 5 of the series so they know that sometimes these code points do not represent whole characters, as far as the user is concerned.

This is hard enough to handle in the Latin script when someone may be really offended at treating "Å" as an "A with a funny circle on it", or "Ł" as an "L that someone stabbed", but it gets much harder to look at "ใท์" or "ஷ்ரீ" and make such judgements, especially when there are individual letters like "い" that look like maybe they ought to be two letters from a visual standpoint.

My biggest worry is of course the message that it sends to have such a template class while my group works so hard to explain that UTF-32 does not solve the problems of what the user thinks of as a character, which can require two or three or four or sometimes even more code points.... and that not even the StringInfo class in the .NET Framework can handle the user perception of a character (since sometimes even what Unicode thinks of as separate letters are thought by users to go together to make one "character").

But maybe I'll get lucky and Ryan will point it out somewhere in a comment that strings like "U+0041 U+030a" or "U+0e43 U+0e17 U+0e4c" or "U+0bb7 U+0bcd U+0bb0 U+0bc0" or "U+0f60 U+0f56 U+0f60 U+0f74 U+0f60 U+0f72 U+0f60 U+0f7c" might be thought of by the user as a single "character", even if they are to some counted here as being 2, 3, 4, or 8 code points long in Swedish, Thai, Tamil, or Tibetan. :-)


# Ryan Myers [MSFT] on 11 Jan 2005 2:11 PM:

Hey -- glad to see you're reading :) Raymond Chen has pointed to you in the past as the local authority for Unicode, so I've been meaning to track you down and ask for criticism.

Yeah, the fundamental problem of code points versus grapheme clusters is constantly at the back of my mind. (Especially when it comes to combining characters that will graphically span the previous AND following base characters.) I've been guilty in the past of assuming that readers bother to read the entire series; perhaps I should start putting qualifiers in :P I've always intended rmstring to be a class for someone who's aware of Unicode's mechanics.

One of the UAXes has a rough algorithm for estimating what sets of code points constitute a grapheme cluster, based on known ranges of points. I'm debating implementing that as an alternate set of iterators, but haven't decided yet.

# Michael Kaplan on 11 Jan 2005 2:53 PM:

Well, this is essentially what the StringInfo class does, but it does not always meet linguistic expectations (like for that Tamil Sri I put up there).

So it may or may not be worth it....

# Ryan Myers [MSFT] on 11 Jan 2005 3:35 PM:

Hrm. One of my planned hedges was to provide a set of predicates for std::sort that would implement various collation algorithms (starting with a functor that just called your Win32 functions with a specific locale) -- perhaps a likewise reasonable approach is to create an iterator adaptor that can advance based on locale-specific interpretations of grapheme clusters.

It's nasty, but it should work, and if push comes to shove I can always use the UAX algorithm as a default template argument.

# Michael Kaplan on 11 Jan 2005 3:57 PM:

Hmmm.... interesting. Of course figuring out where the breaks would be from an API like LCMapString is an interesting challenge that would involve unpacking the sort keys, and there is no 100% way to do that (certainly no way that will be completely lossless and know that an "ae" was really an æ, for example.

I think it always comes back to interview questions, and I could see asking someone how they would design such an API and what would be possible (and what would not). I am sure this would make people think!

Of course getting people to understand sort keys well enough to grapple with the question moght require a four-hour interview so I guess it would not work....

However, even if that were possible it would not always match the user expectation since (for example) in Spanish "ch" is looked at as a unique sorting element but most Spanish-speaking people would not expect this sort element to be consider a grapheme that the cursor would skip around.

go to newer or older post, or back to index or month or day