You think *your* characters have stories? Let me tell you a character story....

by Michael S. Kaplan, published on 2011/10/20 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/10/20/10228020.aspx

There are three so-called "Yiddish digraphs" in Unicode:
U+05F0   wawayim
U+05F1   waw yod
U+05F2   yodayim

What is specifically Yiddish about these digraphs?
They can be used in the same way in Hebrew.
But this isn't done. Why not?

Why should Yiddish be written with special digraphs
but Hebrew with sequences of two letters?

The Unicode Standard says:
| ... to distinguish the digraph double vav from an occurrence
| of a consonantal vav followed by a vocalic vav.

By that reasoning you would need an English digraph "sh"
to distinguish "sh" in "***" from "s-h" in ***hole. ;-)

Lots of people jumped in and the consensus was alon the lines of "I'm not sure, but I think it'd legacy".

Thankfully the guy who should be writing the "Every Character Has a Story" book jumped in to add some surety:

On 10/19/2011 12:08 PM, Mark E. Shoulson wrote:
> I think the issue here is (probably) a matter of legacy encodings,
> though someone else would need to confirm that.

O.k., as self-appointed historian of the standard, I guess I need to be
the one to answer that. ;-)

The Yiddish digraphs were added to the basic set of Hebrew letters for
Unicode 1.0 on behalf of the Research Libraries Group, for compatibility
with their existing usage on the Research Libraries Information Network
(RLIN).

Digging very deep in the old mailbox, I located email from Joan Aliprand
of the Research Libraries Group, dating from July 11, 1991 confirming
this, and noting that "I pushed very hard for inclusion of the Yiddish
digraphs tsvey vovn and tsvey yudn."

It is my recollection that the 3rd digraph was added during the
discussion of
the addition of those two.

At any rate, there is your legacy encoding source for these. Whether or not
the digraphs are used in *current* Yiddish data (or would even be
recommended for such use) is not relevant to reasons for the original
inclusion.

Could it have something to do with the fact that the letters are narrow [and therefore more likely to be useful as a digraph on fixed-pitch typewriters or computer screens - particularly if this is a common letter pair] - compare "ij" [yes, this is a "letter" in Dutch, but: ] contrast "ch" [which is/was a 'letter' in Spanish but never got encoded as a single code point]

Once the visual distinction is there (it's not clear if there would be a visual distinction in handwriting or in proportional fonts - I don't know anything about the language or the script so I can't say), it's easy to invent a semantic distinction: Something like 'This combination of letters means a double consonant some places and not others, and it looks silly to have it grouped together when it's not the double consonant, so only use it there.'

And once there's a perceived semantic distinction, that leads to it getting encoded as a 'real' character [defined here as not having a compatibility decomposition, which it would if it were just a question of legacy character sets; compare "fi"] in Unicode.

Incidentally, all three digraphs are present in the MARC-8 (RLIN) character set: lcweb2.loc.gov/.../32.html

Code point	Character	Name
U+05f0	װ	HEBREW LIGATURE YIDDISH DOUBLE VAV
U+05f1	ױ	HEBREW LIGATURE YIDDISH VAV YOD
U+05f2	ײ	HEBREW LIGATURE YIDDISH DOUBLE YOD