Not everyone likes Unicode

by Michael S. Kaplan, published on 2005/07/26 15:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/07/26/443394.aspx


It is true -- not everyone likes Unicode. This includes a guy by the handle tyomitch who was trying to post a long comment to a post here and hit some kind of length limitation in Community Server. I did not want to appear unwilling to post negative comments about Unicode so I'll put it here. :-)

Can't post this comment to http://blogs.msdn.com/michkap/archive/2005/05/21/420666.aspx, but maybe you care anyway...

When I first came across OCS, I was stumbled by the very limited support for it in Unicode. Then I went discussing it with a person whose job is typesetting books in OCS, and he told me that in his opinion, noone at all in the whole Unicode group cares about _usable_ OCS support, since those OCS users aren't a business target group who can 'sponsor' necessary procedures for registering characters. Sadly, it looks like he's right: what Unicode group has got to this point is some 'pro forma' support, just so they can't say "we're not supporting OCS characters".

Actually, among the people who make the real OCS typesetting software (that is, fonts, on-screen keyboards, Word macros etc.) noone uses Unicode (at least I haven't found anyone). They all use some custom 8-bit charsets, which don't even agree on what is a character and what isn't, complicating translation from one encoding to others even further. That is, there are dozens of totally incompatible OCS charsets used, and there is Unicode, which noone even considers usable. Does this satisfy that Unicode group?

Now, to be done with the preamble :-) The person mentioned in the second sentence agreed that UK has five variant shapes: that is, small/capital ligatures, and small/title/capital digraph. Here are the reasons which I can come up with _against_ considering the ligature suitable for U+0478/U+0479:

1) the digraph isn't two letters stacked together, because the first part is OCS letter ON (identical to U+041E/U+043E) and second part isn't a valid OCS letter at all. The second part is just a glyph that has no meaning outside the context of this digraph. Separating the digraph into two characters is as crazy as would be separating U+2116 "Numero Sign" into combination of U+004E and some character for underlined O on the only base that U+2116 looks like two letters together.

So, if digraph is to be included in Unicode, it can only be represented with a single character.

2) as based on previous Unicode revisions, those fonts that claim to support Unicode OCS actually have the digraph located at U+0478/U+0479. Those fonts include even the stock "Microsoft Sans Serif" of WinXP. Changing the character assignment would break all the existing documents which happen to use this letter (assuming there are any). Also, the Unicode fonts that have 'old-style' glyphs have the ligature at U+0423/U+0443, and the titlecase digraph somewhere in the unused slots of U+0400 block.

3) as already stated in the comments of your post, those shapes are position-based variant forms and not separate letters. Maybe they don't even deserve any more than three characters (for the three cases)? Unlike the 'final sigma' case, now we don't have a legacy charset to keep compatibility with, and compatibility with previous Unicode revisions has already been broken. Maybe it's worth going to the very end and _removing_ the characters U+0478/U+0479?

What I personally would be completely happy with is leaving the two digraph cases at U+0478/U+0479 where they are, allocating a character for the titlecase, and making fonts responsible for displaying U+0423/U+0443 as either the old-style ligature or the modern Y-shaped letter. Is this too simple? Why should there be a (based solely on glyph shape) identity between Cyrillic Letter U and (not ever used alone) second part of the digraph of Cyrillic Letter UK?

Not that I'm expecting a detailed reply, but now I've expressed my opinion... Can this message be at least appended to the comments at http://blogs.msdn.com/michkap/archive/2005/05/21/420666.aspx?

The post is correct about the need to not change fonts around since it would change the way documents have been written. And I am in cases like this pointing to the real dark underbelly of Unicode. But it is very likely (bordering on almost certainly) true that if the Olc Church Slavonic experts are not using Unicode then they are piling up future problems for themselves. If there are missing characters they should be added, and there is certainly no desire on the part of Unicode to not support a plain text requirement....

I would be interested in knowing what communications were rebuffed or glossed over -- and who was doing the apparent glossing. It is certainly not any kind of official or unofficial Unicode policy to do such a thing....

 

This post brought to you by "ΡΈ" (U+0478, a.k.a. CYRILLIC CAPITAL LETTER UK)


# Michael Dunn_ on 26 Jul 2005 11:12 PM:

FYI: The links to articles in your RSS are pointing to https://blogs.msdn.com:443 which makes them unreachable from the aggregator.

# Michael S. Kaplan on 27 Jul 2005 2:32 AM:

Hey Mike -- I think that is a temporary problem on the server -- obviously the links are not like that, thankfully. :-)

# Michael Dunn_ on 27 Jul 2005 6:26 PM:

All better now. :) Yours was the only msdn blog where I was seeing the https URLs, so that's why I posted here.

# Michael S. Kaplan on 27 Jul 2005 8:29 PM:

Of course now my subscription links are broken. :-(

go to newer or older post, or back to index or month or day