Every character has a story #10: U+0478/U+0479 (CYRILLIC LETTER UK)

by Michael S. Kaplan, published on 2005/05/21 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/21/420666.aspx


It all started so innocently, with John Jenkins of Apple asking a simple question on the Unicode List....

Is it kosher for U+0478 CYRILLIC CAPITAL LETTER UK (Ѹ) and U+0479 CYRILLIC SMALL LETTER UK (ѹ) to have basically the same shapes as U+0222 LATIN CAPITAL LETTER OU (Ȣ) and U+0223 LATIN SMALL LETTER OU (ȣ), respectively, instead of the digraph form shown in The Book™?

Michael Everson then helped explain what was going on with this letter in Unicode:

Yes, it is. In discussions with the Slavicist experts (Ralph Cleminson and David Birnbaum), we determined that there was something funny going on in Cyrillic. Both the string o + u (looking like <oy>) and the uk (looking like 8) are used for the same sound in Church Slavic (at least). Of course, just as the LATIN OU derived from a ligature of Greek OMICRON and UPSILON, so does CYRILLIC UK derive from a ligature of O and U (which themselves are Greek OMICRON and UPSILON).

Since Slavicists find that they would like to have a character to represent the 8-shaped [u], and that they can already represent the oy-shaped [u] with a string of characters, and lo! there is a handy-dandy character called UK right there in the standard, we determined that it was extremely sensible to draw UK as an 8, and let Slavicists encode o + u when they want the digraph.

This will eventually be part of the formal recommendations to the UTC regarding changing the glyphs in the Book, but there hasn't been time for that hitherto.

Accordingly, when designing a font containing UK if it has the 8-shape now rather than the oy-shape, it serves Slavicists better than otherwise. And since glyphs are informative....

Joe Becker made a suggestion about John's original question:

Yes, a nice example where an abstract character UK has two written forms: a digraph and a ligature!  In the first draft of the first Cyrillic chart (Unicode 1.0, p. 204) I tried showing both renderings via the split-screen vertical bar, but that just made everything illegible, so I gave up showing the ligature variant.

But Michael Everson disagreed with Joe Becker's recollection:

Well, no, it's a sound, [u] that has two representations, one with o + y and the other with an original ligature of that pair, 8. (And that original ligature was original in GREEK, not newly-formed in Cyrillic.)

Birnbaum's thinking has changed since 1991. The correct way to represent the digraph is with two characters O and U; the correct way to represent the ligature is with the UKs.

The digraph has no character status any more than "ch" in Spanish or "ou" in Greek does.

At this point, Ken Whistler decided to weigh in with some of the less theoretical aspects that have to weigh in here beyond the linguistic considerations that Michael had just brought up

Historically, I suspect you are correct, but you cannot ignore the encoding facts on the ground, as it were. This has been shown as a digraph in both standards now for over a decade, and appears in widespread fonts that way as a result, too. I think we have little choice now but to acknowledge that both the digraphic glyph and the stacked ligature glyph are acceptable renderings of U+0478/U+0479 at this point.

That does not prevent anyone from accepting your position on textual representation, and writing out the sequence <o, u> for the digraph in OCS text, as one would do for Greek.

Fortunately (or unfortunately, depending on your point of view) there has never been a compatibility decomposition for U+0478/U+0479, so there is no formal question of normalization involved in this matter. Hence, the equivalence between U+0479 and <U+043E, U+0443> is more in the nature of alternate spellings, and people who want to search Old Cyrillic text will just need to be aware of such equivalences, just as they would have to be for other alternate spellings or variations in orthography.

Michael Everson did not entirely agree with this assessment:

Unicode 4.1 should (after this gets written up and sent to UTC) use the 8-glyph and note that the digraph glyph is not preferred for this character.

>That does not prevent anyone...

Anyone like, say, Slavicists?

They are more or less in complementary distribution, at least in normalized tsarist OCS. I have a book here printed in 1861, which uses Ou- or ou- initially, but -8- medially and finally. The capital 8-UK also has a (rather) unique angular capital distinct from the round small letter, at least in the Slavonic font used.

Asmus Freytag agreed with Michael Everson, and had some further input:

The other fortunate thing is that the character U+0478 is classified as 'Uppercase', not 'Titlecase', even though the glyph for the Oy digraph looks like it would be titlecase. In fact, I'd be curious to know whether an all-caps string would really use an Oy glyph instead of an OY digraph.

If my suspicion (that the current glyph for U+0478 is actually a titlecase glyph) proves out, changing the glyph to the 8 form would be even more motivated.

And then Michael Everson did have the last word. :-)

I imagine it would not, because otherwise it would look pretty awful; that is, I think that it would be OUSTNE not OySTNE. My sourcebook (a saint's life) does have all-caps strings in it (why is why I have seen the capital medial/final 8-UK) but happens not to show the UK in initial position.

The glyph for the capital UK should look like a vertical fusion of  IZHITSA and O -- or at least it does in my book.

Now earlier on, Ken Whistler did have a thought about the matter.

BTW, for anyone who is keeping track, this thread is another decent candidate for the "Every character has a story" archives.

Which is why this post exists now. Because every character does indeed have a story!

 

This post brought to you by "Ѹ" and "ѹ" (U+0478 and U+0479, a.k.a. CYRILLIC CAPITAL LETTER UK and CYRILLIC SMALL LETTER UK)


# Larry Osterman [MSFT] on 21 May 2005 2:14 AM:

A theme I've noticed in these "Every character has a story" posts:

None of the writers seem to be native speakers of the language (based solely on the unscientific basis of looking at their apparently western names). But all of them seem to be able to expound authoritatively about the experience that the native speakers of the language expect.

How is it that the participants in the Unicode Consortiums forums can be so confident that they know what a native speaker of a language expects to find?

I know that Mark Crispin and I have had some interesting discussions about this exact issue - Mark used to feel rather strongly about the values in some of the codepoints for the Chinese and Japanese character sets that were collapsed because he was told that native speakers were quite incensed about the collapse. And then he actually went and asked the native speakers (he speaks Japanese fluently (and I believe Mandarin as well)) and found some rather different opinions.

# Michael S. Kaplan on 21 May 2005 2:44 AM:

That is actually a very interesting question -- I think I will cover that in a blog posting soon. :-)

# CornedBee on 21 May 2005 3:34 AM:

The Attic Greek I learned about in school has an Epsilon (looks like a small inverted 3) and an Ypsilon (Looks like a Y), but no Upsilon. What am I missing?

Also, my Mozilla, running on a Linux platform, displays quite different glyphs for the UK and the OU. Notably, the UK looks like "Oy" and the OU looks like "8".

# Michael S. Kaplan on 21 May 2005 4:05 AM:

Look at http://www.fileformat.info/info/unicode/char/03a5/index.htm for an UPSILON and http://www.fileformat.info/info/unicode/char/0395/index.htm for an EPSILON.

Maybe it is just transliteration differences. :-)

For the different appearances, it is entirely dependent on the fonts you have and what the browser picks....

referenced by

2005/07/26 Not everyone likes Unicode

2005/05/22 For every expert...

go to newer or older post, or back to index or month or day