Every character has a story #19: U+200c and U+200d (ZERO WIDTH [NON] JOINER)

by Michael S. Kaplan, published on 2006/02/15 11:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/02/15/532394.aspx


In the world of Unicode, it is small irony that what usually causes the most emails to be exchanged and the most documents to be written are the characters that have no actual visible representation.

Whether it is U+feff (a.k.a. ZERO WIDTH NO BREAK SPACE, a.k.a. the BOM), U+200e/U+200f (a.k.a. LEFT-TO-RIGHT MARKER/RIGHT-TO-LEFT MARKER), or many of the others, it is these characters that often have the most dangerous and troublesome influence on surrounding text.

I am going to talk about two more of them now....

They are:

Now these characters each have a simple purpose -- the former is to suggest that the characters preceding it and following it should not try to join or ligate, and the latter is to suggest that the characters preceding it and following it should try to join or ligate.

If you have neither then it is something that basically up to the font and rendering engine what they want to do based on their impression of the desired behavior.

And if the font and/or shaping engine wish to ignore the suggestion, they are free to do so at will -- which they often will do if they do not have any specific behavior within their understanding of the two characters.

ZWJ and ZWNJ are only supposed to be used to suggest visual distinctions, not ones that would change the meaning or interpretation of the characters.

They are thus supposed to be ignored in things like the Unicode Collation Algorithm and outright stripped in things like StringPrep.

The problem is that sometimes they do convey semantic meaning or content.

Like in the native Sinhalese word for Sri Lanka, which is the first of these two strings and requires a ZERO WIDTH JOINER:

Now tell me if you think that the first one -- which is the name and needs a U+200d) looks different enough from the second to be considered a significant difference to native readers.

Definitely an example where a linguistic distinction controlling how a language is rendered can quickly become a political one!

Or an reported analagous situation with U+200c and Myanmar. Or perhaps the several reported cases where Farsi appears to show similar issues.

I suppose the  conclusion of all this is simple enough: one person's ignorable suggestions are another person's crucial directions. :-)

Obviously this is an issue that needs to be figured out, especially with UTR #36 and UTR #39 being relied on so heavily to provide guidance on how consumers of Unicode who want to avoid spoofing and other issues.

Preferably in a way that not offend anyone linguistically, or politically.

But look out for those invisible characters. As we learned in the movie Small Soldiers, they are like the wind. Just because you can't see something doesn't mean it's not there....

 

This post brought to you by U+200c and U+200d (a.k.a. ZERO WIDTH NON-JOINER and ZERO WIDTH JOINER)


# Yosuke HASEGAWA on 16 Feb 2006 12:56 AM:

Using ZERO WIDTH (NON) JOINER or ZWNBSP(BOM) to filename or registory key and values, you can create several files that appearance is the same name.

This may cause visual problems in security domain. So I hope to disable using Unicode control characters for filename in Windows.
Of course, Bidi control characters such as "RLO" too.

# Michael S. Kaplan on 16 Feb 2006 9:17 AM:

Well, take a look at http://blogs.msdn.com/michkap/archive/2006/02/16/533226.aspx which talks about this issue a bit -- definitely more complicated than disallowing a few characters! :-)

# UdaraG on 21 Apr 2006 4:24 AM:

Hi Michael;

Firstly, I am a Sri Lankan who uses Sinhala as my mother tongue, and was quite fascinated to see your use of the same in the example above!

Secondly, I am at the early stages of developing a Sinhala IME for the PocketPC, and am having problems in sending the ZWJ along with leading/trailing characters so that it could be correctly interpreted and rendered on, say Pocket Word.

1. I imported the Sinhala character range from a Unicode-compatible font (called "Malithi Web") into the standard MS Tahoma font, which I have copied over to the PPC.
2. With a rudimentary IME developed in eVC 4.0, I did a IMCallback.SendString() for the Unicode string {0x0DC1, 0x0DCA, 0x200D, 0x0DBB, 0x0DD3, NULL}.

Though this should give me (as I understood) the first form in your example above, it actually gives me the second non-ligated form with a verical bar (which I presume to be the ZWJ) drawn in between the two characters!

What am I doing wrong?
Should I use IMCallback.SendCharEvents() instead?
Can you please point me to some code samples?

Thanks and regards,
Udara

# Michael S. Kaplan on 21 Apr 2006 9:05 AM:

Hi Udara,

I see a few problems with what you have mentioned here:

"I imported the Sinhala character range from a Unicode-compatible font (called "Malithi Web") into the standard MS Tahoma font"

Of course the licensing concern is a very real one -- I do not know whether it is legal to modify Tahoma in such a way?

Beyond that, there is the problem of making sure all of the opentype tables are moved over properly. It is not enough to be 'Unicode-compatible' as Arial in Win95 is that; what is important is the right OT table info.

And beyond that, see http://blogs.msdn.com/michkap/archive/2005/05/19/420145.aspx which points out that Uniscribe is not present in every WinCE install, even if it is of the latest version.

Of course beyond all that would be the question of whether the updates for Sinhala that were added so recently are in the CE version of Uniscribe. They may not have the XPSP2/ELK/Vista updates just yet....

# UdaraG on 8 May 2006 7:42 AM:

Hi Michael;
Thanks for the prompt reply, and extremely sorry for the much delayed follow up!

Yes, I do understand the licensing concerns very well - myself being a software developer... - the only reason I modified the Tahoma font was since I had a small problem getting the font linking to work. :-)

Now everything is perfect, expect for the ligatures - of which Sinhala language has a number of. I have tried so many devices and WinCE OS versions, including the very latest, but to no avail!

My conclusion is what you guessed right at the start - "updates for Sinhala that were added so recently are NOT in the CE version of Uniscribe".

1. Is there a way to ascertain this from anybody at MS, so that I need not worry about this any further?

2. From the ref-link you provided above, looks like i have no option short of moulding out an OS image (with WinCE Platform Builder) with Sinhala language support.
What may be the next steps in building support for a custom language (Sinhala) with WinCE PB?
Assuming I work fulltime on it ,may be even burning some midnight oil :-), what's the effort/time estimate look like? I do have extensive C/C++ experience with me, and have already downloaded the evaluation copy of WinCE PB (my employer is hoping to buy it soon).
However, this is my first stint with WinCE PB, and am expecting a considerable learning curve.

Thanks and regards,
Udara

# Michael S. Kaplan on 8 May 2006 10:20 AM:

Well, the shaping engine support will not yet be there in the CE Uniscribe, so there really is no way to make it all happen just yet.

I do not know the timeline for updates on the mobile platform, but if I find out, I'll post about it....

# UdaraG on 9 May 2006 4:11 AM:

Thanks very much, Michael!

referenced by

2015/07/08 Fixing up broken and semi broken blog posts, as needed?

2007/10/14 The subtle difference between ශ්රී ලංකාව and ශ්‍රීලංකාව

2006/09/25 Why don't all the half forms sort right?

2006/02/17 What do you get when you combine a base character with a buttload of diacritics?

2006/02/16 Ignoring a problem does not make it go away....

go to newer or older post, or back to index or month or day