Every character has a story #38: U+1e98, aka a ring atop a 'w' isn't ideal for a proposal (marriage or otherwise!)…

by Michael S. Kaplan, published on 2012/04/17 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2012/04/17/10294366.aspx

So the other day, David Starner asked over on The Unicode List:

At Wiktionary, we're looking at ẘ (U+1E98) and we can't figure out where it came from. It's from Unicode 1.1, which makes it hard to look up discussion on adding it, and the characters around it don't seem to give clues to its origin.

The character in question?



You know, the one that could have also been

Also known as U+0077 U+030a.


Now David sent his mail at 7:15pm on a Sunday, so we wree guaranteed to get some fun responses!

Rick McGowan gave the "Republican anti-illegal immigration response:

Good catch. It's obviously another stowaway...
Just throw it in the brig until we can get around to deporting it.

And Shriramana Sharma decided to have a little fn with it, too:

ẘoẘ, hoẘ many more stoẘaẘays do ẘe haẘe?

I remember this Asterix story (Asterix and the Vikings?) where there were lots of rings above rolling around! ;-)

Thankfully, Asmus Freytag was around to provide a serious answer.

Thiough it was still Sunday night. What was he thinking? :-)

The 1E00 and 1F00 blocks were populated, in Unicode 1.1 by rejects from Unicode 1.0 that were re-admitted as part of the merger with ISO/IEC 10646. If you have anyone with access to the early (paper only) meeting documents of WG2, you might, just might, find a source for them.

Most of these characters were "rejected" because they were unnecessary - they are easily encoded as combining sequences and there were no legacy character sets that needed them precomposed for 1:1 roundtrip compatibility. WG2 and Unicode (before the merger) had different standards on what compatibility characters were required.

(There were some gaps in these blocks after the initial population of characters were added in Unicode 1.1. These were later filled with more solid candidates, so the "age" of each character is an important clue here).

Stowaway is an apt term - because the characters did not add anything new (they could already be encoded as combining sequences) and because normalization would remove them from the data stream, nobody tried very hard to fine-tune the set and as a result risk the failure of the merger. Ideal conditions for "stowaways" to enter hiding in the crowd.

The next morning, Andreas Prilop provided some more info:

U+1E96 has the note "Semitic transliteration". Indeed U+1E96 to U+1E9A are used for transliterating Arabic according to ISO 233.

"w with ring" is "waw with sukun".

And Arno Schmitt built on that response:

this must be the answer.

U+1E97 is used for "Semitic transliteration" as well, and U+E99 is in Arabic very similar to U+1E98, but *any* consonant occurs with sukun, so why did they not encode "b with ring", "d with ring", "d with dot below and ring above" and so on?

And seriously, I'd like to have "s with macron below" -- although I know there is no chance of getting it encode. it is used for transcribing Arab dialects.

But Asmus Freytag had no chloice to dash the hopes of that officially.

Not Arno's hopes since he clearly knew, but the hopes of anyone else reading:

All of these combinations exist as combining sequences. There's not benefit in encoding them, on the contrary, adding them now would destabilze normalization and therefore they can't be added. Most of the ones in the 1E00 block wouldn't have been added except for the
particular history of how Unicode 1.1 was arrived at.

Karl Williamson mifght have given people some hope, responding to Arno's "s with macron below" dreams:

But, couldn't a named sequence be created for it?

And finally, mercifully, the cool uncle of Unicode rewsponded to the early words of Asmus:

On 4/15/2012 10:04 PM, Asmus Freytag wrote:
> The 1E00 and 1F00 blocks were populated, in Unicode 1.1 by rejects
> from Unicode 1.0 that were re-admitted as part of the merger with
> ISO/IEC 10646. If you have anyone with access to the early (paper
> only) meeting documents of WG2, you might, just might, find a source
> for them.

Well, guess what -- I have access to someone with the relevant meeting documents. ;-)

The first key document is:

WG2 N754, Review of repertoire, by Masami Hasegawa, dated September 1991. (Mark Davis and I assisted Hasegawa-san in pulling together the lists in this document.)

That document lists *all* of the Latin composite letter collections that Hasegawa-san, then the editor of 10646, had to wrestle with, in order to come up with an acceptable draft for the 2nd DIS, after the failure of the first DIS vote and the determination by WG2 that a merger of repertoires was necessary to construct a DIS that could pass. (A lot of other architectural changes were necessary as well, but right now I'm focussing on the Latin repertoire issue.)

Section 1.1.2 of WG2 N754 reads:


1.1.2 Latin Composites, Collection #2A

Extra Latin composites, descending from DP1 of 10646. These are derived from a
variety of sources, and are intended to cover a number of languages and
transcriptional systems (e.g. various Indo-European and Semitic transcriptions).


There then follows a long list of composite characters that were in DP1 of 10646. WG2 N754 then goes on to identify which of those particular characters were supported by explicit national requirements in the ballot record. The remainder were winnowed down, using a list of exceptions, explicitly spelled out  on page 8 of WG2 N754. What was left constituted the bulk of the composite Latin characters that were eventually included in the 2nd DIS in the range of 1E00..1E95, and which you see there still in the standard.

O.k., so far so good. But you may well ask, what about 1E96..1E9A, which includes the ẘ character? How did *those* get in?

Well, the pertinent document for that is WG2 N759, "Liaison Statements to JTC1/SC2/WG2 considering the Arabic part of ISO DIS 10646M", from the ECMA (European Computer Manufacturers Association) Arabic Task Group, dated October 1991. The relevant portion of that document is Appendix A (=ECMA ATG N213), "Tranliteration [sic] characters for Arabic characters and Hieroglyphs", authored by Alaa Ghoneim, who at the time was representing Egypt during the WG2 meetings. Alaa Ghoneim cites as  sources ISO 233 Parts 1 and 2 (for Arabic) and the Egyptian Grammar by Gardiner for Latin transliteration of hieroglyphs.

Part II of that Appendix says: "The following [9] characters do not exist in 10646 and hence need to be added in plane 0", followed by 9 composite transliteration characters from one of those two sources -- not individually identified.

WG2 N759 was discussed at the Paris meeting of WG2 (October  7-11, 1991). The minutes from that meeting (WG2 N767) note:

"N759 ECMA Arabic TG Input and N746 Input from Egypt
1) 9 missing characters for transliteration
    ==> review all transliteration characters

Hasegawa-san took that under advisement and determined that 3 of the transliteration characters in that list of 9 were in fact already in the draft of the DIS.

The remaining six are those which you now see in the range U+1E96..U+1E9A, including the ẘ in question.

No national body objected to the inclusion of those particular 6 in the voting on 10646 DIS 1.2, so they ended up published in the eventual 10646-1:1993 (and in Unicode 1.1).

And that, folks, is the origin of ẘ.

And every character still has a story.

Though clearly it takes more than a ring to give a relationship where it can be respected.


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day