by Michael S. Kaplan, published on 2005/12/30 00:01 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2005/12/30/508157.aspx
James Brown asked in the microsoft.public.win32.programmer.international and microsoft.public.win32.programmer.gdi newsgroups:
Suppose I have the following two Arabic codepoints:
U+0648 "arabic letter waw" U+0650 "arabic letter kasra"
These render as a single glyph with Uniscribe
When pasted into Notepad, the cursor (and selection highlight) can traverse into the middle of the cluster.
When pasted into Wordpad, the cursor _cannot_ move into the middle of these characters.
Which is the correct (or desirable) behaviour?
Maybe someone can even explain, what significance does it have for the cursor to move into the *middle* of a grapheme cluster - how does the user know which character he/she has selected??
thanks, James
Excellent question, James!
The desirable behavior is what you are describing as the WordPad behavior, to a point. Although if I paste a string of 12 of these pairs of characters (وِوِوِوِوِوِوِوِوِوِوِوِ) into WordPad, it will treat them as a single unit, which is not what I would call desirable. :-(
The Notepad behavior you describe is also not preferred; in all cases other than the BACKSPACE character (for the reasons I describe here), you would want to have movement jump the text element boundaries, which would be those two characters you mentioned....
The bad news is that I can reproduce the behavior you describe in Windows Server 2003 SP1:
The worse news is that I can reproduce the WordPad behavior I describe above in Windows Server 2003 SP1 and XP SP2.
But the good news is that in XP SP2, Notepad behaves correctly and the cursor does not appear in the middle of the character....
In IE6, I currently get the character splitting behavior. You can test out your own browser and version with the textbox below -- put the cursor in and move back and forth to see what happens:
At least products are getting better though (the Vista version of Uniscribe has all of the XP SP2 updates and more!).
This post brought to you by "و" (U+0648, a.k.a. ARABIC LETTER WAW)
#James Brown on Friday, December 30, 2005 1:01 PM:
A further observation. With Notepad, I can use the mouse to move into the center of the glyph (i.e. between the two codepoints). But when I use the keyboard the caret always moves over the glyph-clusters.
Strange. The problem is I don't understand the Arabic language so I have no idea if positioning the caret (with mouse) in the middle of the cluster is at all meaningful..
James
#Michael S. Kaplan on Friday, December 30, 2005 2:50 PM:
Well, meaningful is a relative term, of course. :-)
Although I may not expect if you click right in the middle of a cluster that it would respect what I did *that* much (since after all I cannot click in the middle of a U and have that respected.
Or more to the point, since I cannot click in the middle of "Ů" (U+0055 U+030a), I would not expect it to work for other text elements. The cursor movement is a lot more intuitive than the direct insertion via mouse click....
#Michael S. Kaplan on Friday, December 30, 2005 8:40 PM:
For what it is worth, if any testers from typography are around this may be worth putting a bug in, unless I am misunderstanding how it ought to work.... :-)
This is interesting. Using IE 6.0, I can only click between and select the individual waws. However, using the arrow keys it moves one codepoint for each keypress. This means that I if I press Left once it moves to the middle of the letter waw, then Left again to put the caret between letters. If I hold down shift, it will only select whole graphemes, but I have to press Left twice to get the caret to move.
#James Brown on Saturday, December 31, 2005 7:50 AM:
I think what they are saying is that the Notepad behaviour I described is desirable?!!
#Nick Lamb on Saturday, December 31, 2005 3:49 PM:
I don't think so James, although their choice of example in mentioning Notepad is probably bad.
A ligature is not the same thing as a character. Look at the squiggle "fi" (fi) in Latin script. It's a typographic convention rather than a basic unit of the language. There's no way for me to type it on this keyboard, but my typesetting software uses it automatically in printed documents when the alternative would be an ugly "near miss" of two alphabetic characters.
So, the example in this post by Michael is correct, the U+0650 KASRA isn't a separate character, so your cursor should ignore it and you shouldn't be able to select it. But the U+0627 ALEF and U+0644 LAM in the IBM example are separate characters, so you can move the cursor between them even though they're drawn as a single squiggle.
They're both correct, and there's software out there which gets this right, GTK+ provides default entry and text input controls which do this correctly for example. Presumably if Notepad functions properly in XP SP2 then this means the Windows Common Controls now also get it right. Is that right Michael?
#Michael S. Kaplan on Sunday, January 01, 2006 11:21 PM:
"Presumably if Notepad functions properly in XP SP2 then this means the Windows Common Controls now also get it right. Is that right Michael?"
Possibly, Nick -- although I would be afraid to try and predict what the Shell common controls will do on a given day. :-)
I've been doing some more experimentation with Uniscribe and I've found that the two Arabic characters I originally mentioned are in fact rendered as *two* glyphs. Uniscribe does however classify them as a single cluster but they are drawn as two - it just looks as if it is a single character. Perhaps this is what is causing Notepad (and the ScriptString API) to allow the cursor to be placed in the middle?
#Michael S. Kaplan on Monday, January 02, 2006 8:22 AM:
Hi James,
If connection points between a latin letter and a diacritic are poorly defined then they will sometimes not appear to be connected at all -- yet we would never think of that as two characters. The behavior in Notepad must have some kind of explicit history....
What's the story about XP x64? I would imagine that it should be treated like Win2003 SP1 in this matter, but it would be far more interesting if it actually behaves like normal XP.
#Michael S. Kaplan on Wednesday, January 04, 2006 10:22 AM:
That is an excellent question, CN. I believe it would act more like Server 2003 SP1 since it was built out of that code tree, rather than picking up thre features of XP SP2....
Today I was able to briefly experiment with WordPad on a PC running XP. The behaviour was most remarkable. The sense of the cursor keys seemed to be reversed in the RTL text?
#Michael S. Kaplan on Thursday, January 05, 2006 5:16 PM:
Indeed, Nick! And believe it or not, that is what people using RTL languages expect on computers.
"believe it or not, that is what people using RTL languages expect on computers."
Your competitors don't seem to agree. The Windows logical-caret is by far easier to implement than the visual caret apparently recommended by Apple (in Mac OS), Sun (in Java) and by the independent i18n teams for projects like GTK+. It's not impossible that they're all relying on an incorrect intuition, but it does seem more likely that the early Windows BiDi code was just lazy...
#Michael S. Kaplan on Friday, January 06, 2006 11:22 AM:
Actually, it mostly means that the others you refer to are not following the Unicod Bidi algorithm. :-)
How so? A brief scan of UAX #9 doesn't find insertion points, carets/ cursors or any other relevant terminology mentioned...
#Michael S. Kaplan on Friday, January 06, 2006 12:48 PM:
Ask your average Arabic or Israeli person who uses computers about their preference in regard to logical vs. visual order.
Visual order is the primitive that came before technology supported their languages, and all of the other matters fall out of respecting logical order.
I don't think that answers my queston Michael. Where does UAX #9 specify the Windows logical-caret rules ? If that's not what you meant, please clarify.
It seems unlikely that I can find an ample supply of suitable computer users who don't use one of the systems already discussed. You may recall writing an article about this sort of problem recently...
#Michael S. Kaplan on Friday, January 06, 2006 1:53 PM:
Hi Nick, People who use Hebrew or Arabic think of it with logical ordering. It models the reality of their language -- that their text is right to left. I can't speak for the people using systems that were designed prior to the existence of logical ordering other than to sasy that THEY are the early folks too IMHO lazy to encode correctly who think that spelling a letter backwards is going to somehow be easier. Now the work to handle selection and cart movement is a natural extension to logical odering; the work to do is visually is a natural extension to that IMHO lazy effort to spell things backwards, the one that does not match the way the languages actually work....
"It models the reality of their language -- that their text is right to left."
You're being a bit obtuse Michael, the phenomenon we're talking about here is specifically that the left arrow moves right, and vice versa. What I wanted to know was whether there was any basis for this except the historical fact that Microsoft's earliest implementation did this, and the answer seems to be "No".
Sure enough, my /mouse cursor/ moves across Arabic text normally, it's only the left and right arrow keys which are swapped. This is actually connected to the original post, internally the software is just tracking cursor position as an offset into a character array, but you can't handle the WAW + KASRA combination this way, and nor can you make the arrow keys work consistently. It also of course produces problems outside the BMP, but we've covered those previously.
#Michael S. Kaplan on Friday, January 06, 2006 6:15 PM:
My opinion is one that is based on talking to people who use computers for both Arabic and Hebrew as their native language. And they do NOT find it confusing, and in their opinion the visual support *is*.
I am not being obtuse, but this is obviously a bit too much to try to cover in comments. I will do a new post describing what I am trying to say over the weekend.
#James Brown on Thursday, January 12, 2006 6:39 PM:
Well I'm still confused about this whole thing :-) I'm now looking at Word 2003 and notice that it has the same behaviour as Notepad. I am using the following Unicode code-point from the Arabic script:
064a 064f 0633 0627 0648 0650 064a
I have included a link to 2 GIFs on my site which illustrate this string rendered in Word2003:
For the image above, I set the "Show Diacritics" option in Word's Complex-Scripts option-page tab. The diacritics (or whatever they are!!) are shown in orange.
This image shows the mouse-selection has moved half-way into one of the glyph clusters which contains the diacritic.
With Word, the cursor-keys move cluster-by-cluster. However the mouse allows the caret to be placed in the middle of clusters.
The low-level Uniscribe functions (ScriptShape, ScriptXtoCP) allow the caret to be placed mid-cluster. I still can't figure out if this is right or not. I'm trying to replicate this behaviour in my unicode text-editor - it's easy to get the caret placed mid-cluster (because that's what Uniscribe always does, I can't seem to tell it not to). The difficulty is rendering the glyph-cluster to appear as two "characters" even though it is really one (using two code-points though)...I have to draw the cluster twice, over the top of each other, using clipping to get the desired effect...its nasty. But someone at Microsoft obviously thinks its right because Word, and any app which uses ScriptString API (that includes Notepad) exhibits this behaviour. Help!
Michael, would you like to comment on this further?
ok I'm getting closer to understanding how the caret is getting placed in the middle of a cluster. This is a quote from the MSDN docs on Uniscribe in the section 'Notes on ScriptXtoCP and ScriptCPtoX':
"Cluster information in the logical cluster array is used to share the width of a cluster of glyphs equally among the logical characters they represent....For Arabic and Hebrew, caret positions are interpolated with clusters."
So what this means is, for certain scripts (Arabic+Hebrew I guess!), the caret position is obtained by dividing the cluster-width by the number of code-points which make up that cluster, and the caret is then "snapped" to one of these finer-grain boundaries. I guess this is also how text-renderers know how to draw a selection-highlight part-way through a cluster.
well I understand the mechanism, but still don't understand why it is necessary to split a discrete cluster into segments like this. Presumably it only makes sense for Arabic+Hebrew.
Please consider a donation to keep this archive running, maintained and free of advertising. Donate €20 or more to receive an offline copy of the whole archive including all images.