The subtle difference between ශ්රී ලංකාව and ශ්‍රීලංකාව

by Michael S. Kaplan, published on 2007/10/14 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/10/14/5448243.aspx


(The two words might look alike if you don't have the latest and greatest set to render on your machine!) 

Self-described semi-regular reader Georges was curious about whether I knew what was going on with the name of Sri Lanka as described in Bill Poser'sLanguge Log post of not quite a year ago entitled Map of South Asia.

This is the detailed version of the map from Wikimedia:

This is the very issue I mentioned and gave examples of in Every character has a story #19: U+200c and U+200d (ZERO WIDTH [NON] JOINER).

The difference between ශ්රී ලංකාව and ශ්‍රීලංකාව, by which I mean (swapping them below to confuse you a bit if you can't see them!) the difference between

ශ්‍රී

and

ශ්රී

(you will only see the difference if your machine can also distinguish) is decided by a ZWJ in the appropriate spot.

The difference between

U+0dc1 U+0dca U+200d U+0dbb U+0dd3

and

U+0dc1 U+0dca U+0dbb U+0dd3

which is particularly annoying given what happens to ZWJ in IDN and other cases....

The trouble with the printed map? Hard to say -- it could pre-date the convention entirely or it could have been put together on a machine or by a program that didn't support the right rendering.

It is really the trouble in this mid-point between when things first get into Unicode and when things setle down enough for implementations to pick up the support....

For most of the people who I have talked to in country (programmers and such usually!), the distinction was much as Bill put it:

...which makes sense but is not the way the word is actually written...

But there are others who find this difference to be quite troubling and are clearly much more bothered by it, especially as one moves out of the realm of the programmers. It appears to be a real semantic difference to many people.

 

This post brought to you by U+200c and U+200d (a.k.a. ZERO WIDTH NON-JOINER and ZERO WIDTH JOINER)


Chris Becke on 16 Oct 2007 3:36 AM:

The first row of little boxes has a seperation betweeh the fourth and fifth box?

Maybe it is time to upgrade to Vista.

Michael S. Kaplan on 16 Oct 2007 7:35 AM:

LOL!

Well, there is an actual space (U+0020) in there too, but the rest of the characters are comp letely iden tical.... :-)


referenced by

2012/03/06 What do සිංහල and O'zbekcha have in common?

go to newer or older post, or back to index or month or day