by Michael S. Kaplan, published on 2005/05/21 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/21/420666.aspx
It all started so innocently, with John Jenkins of Apple asking a simple question on the Unicode List....
Is it kosher for U+0478 CYRILLIC CAPITAL LETTER UK (Ѹ) and U+0479 CYRILLIC SMALL LETTER UK (ѹ) to have basically the same shapes as U+0222 LATIN CAPITAL LETTER OU (Ȣ) and U+0223 LATIN SMALL LETTER OU (ȣ), respectively, instead of the digraph form shown in The Book™?
Michael Everson then helped explain what was going on with this letter in Unicode:
Yes, it is. In discussions with the Slavicist experts (Ralph Cleminson and David Birnbaum), we determined that there was something funny going on in Cyrillic. Both the string o + u (looking like <oy>) and the uk (looking like 8) are used for the same sound in Church Slavic (at least). Of course, just as the LATIN OU derived from a ligature of Greek OMICRON and UPSILON, so does CYRILLIC UK derive from a ligature of O and U (which themselves are Greek OMICRON and UPSILON).
Since Slavicists find that they would like to have a character to represent the 8-shaped [u], and that they can already represent the oy-shaped [u] with a string of characters, and lo! there is a handy-dandy character called UK right there in the standard, we determined that it was extremely sensible to draw UK as an 8, and let Slavicists encode o + u when they want the digraph.
This will eventually be part of the formal recommendations to the UTC regarding changing the glyphs in the Book, but there hasn't been time for that hitherto.
Accordingly, when designing a font containing UK if it has the 8-shape now rather than the oy-shape, it serves Slavicists better than otherwise. And since glyphs are informative....
Joe Becker made a suggestion about John's original question:
Yes, a nice example where an abstract character UK has two written forms: a digraph and a ligature! In the first draft of the first Cyrillic chart (Unicode 1.0, p. 204) I tried showing both renderings via the split-screen vertical bar, but that just made everything illegible, so I gave up showing the ligature variant.
But Michael Everson disagreed with Joe Becker's recollection:
Well, no, it's a sound, [u] that has two representations, one with o + y and the other with an original ligature of that pair, 8. (And that original ligature was original in GREEK, not newly-formed in Cyrillic.)
Birnbaum's thinking has changed since 1991. The correct way to represent the digraph is with two characters O and U; the correct way to represent the ligature is with the UKs.
The digraph has no character status any more than "ch" in Spanish or "ou" in Greek does.
At this point, Ken Whistler decided to weigh in with some of the less theoretical aspects that have to weigh in here beyond the linguistic considerations that Michael had just brought up
Historically, I suspect you are correct, but you cannot ignore the encoding facts on the ground, as it were. This has been shown as a digraph in both standards now for over a decade, and appears in widespread fonts that way as a result, too. I think we have little choice now but to acknowledge that both the digraphic glyph and the stacked ligature glyph are acceptable renderings of U+0478/U+0479 at this point.
That does not prevent anyone from accepting your position on textual representation, and writing out the sequence <o, u> for the digraph in OCS text, as one would do for Greek.
Fortunately (or unfortunately, depending on your point of view) there has never been a compatibility decomposition for U+0478/U+0479, so there is no formal question of normalization involved in this matter. Hence, the equivalence between U+0479 and <U+043E, U+0443> is more in the nature of alternate spellings, and people who want to search Old Cyrillic text will just need to be aware of such equivalences, just as they would have to be for other alternate spellings or variations in orthography.
Michael Everson did not entirely agree with this assessment:
Unicode 4.1 should (after this gets written up and sent to UTC) use the 8-glyph and note that the digraph glyph is not preferred for this character.
>That does not prevent anyone...
Anyone like, say, Slavicists?
They are more or less in complementary distribution, at least in normalized tsarist OCS. I have a book here printed in 1861, which uses Ou- or ou- initially, but -8- medially and finally. The capital 8-UK also has a (rather) unique angular capital distinct from the round small letter, at least in the Slavonic font used.
Asmus Freytag agreed with Michael Everson, and had some further input:
The other fortunate thing is that the character U+0478 is classified as 'Uppercase', not 'Titlecase', even though the glyph for the Oy digraph looks like it would be titlecase. In fact, I'd be curious to know whether an all-caps string would really use an Oy glyph instead of an OY digraph.
If my suspicion (that the current glyph for U+0478 is actually a titlecase glyph) proves out, changing the glyph to the 8 form would be even more motivated.
And then Michael Everson did have the last word. :-)
I imagine it would not, because otherwise it would look pretty awful; that is, I think that it would be OUSTNE not OySTNE. My sourcebook (a saint's life) does have all-caps strings in it (why is why I have seen the capital medial/final 8-UK) but happens not to show the UK in initial position.
The glyph for the capital UK should look like a vertical fusion of IZHITSA and O -- or at least it does in my book.
Now earlier on, Ken Whistler did have a thought about the matter.
BTW, for anyone who is keeping track, this thread is another decent candidate for the "Every character has a story" archives.
Which is why this post exists now. Because every character does indeed have a story!
This post brought to you by "Ѹ" and "ѹ" (U+0478 and U+0479, a.k.a. CYRILLIC CAPITAL LETTER UK and CYRILLIC SMALL LETTER UK)
# Larry Osterman [MSFT] on 21 May 2005 2:14 AM:
# Michael S. Kaplan on 21 May 2005 2:44 AM:
# CornedBee on 21 May 2005 3:34 AM:
# Michael S. Kaplan on 21 May 2005 4:05 AM:
referenced by
2005/07/26 Not everyone likes Unicode
2005/05/22 For every expert...