More on the Unicode 5.0 beta

by Michael S. Kaplan, published on 2005/12/18 15:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/12/18/505145.aspx


Last week I mentioned that the Unicode 5.0 beta had started (thanks to Mike G.1 for the link!).

I thought it might be nice to look at the coolest summary site of upcomg characters that is available, the Unicode Pipeline Table.

Of course, the Pipeline Table has one important disadvantage when compared to a diff of files like UnicodeData.txt in the Unicode Character Database, which is the fact that it does not directly say which additions are planned on being in Unicode 5.0 and which ones are not.

Luckily, the Pipeline Table lists the status of characters in both the UTC (which only talks about the data the characters were accepted) and ISO (which talks about the current stage of the proposal). The stages are as described here:

  1. Initial Proposal: A proposal document has been submitted to WG2 and is included in the formal document record. The proposal may not yet have been taken up on the agenda of a WG2 meeting, or may have had been included pro forma on an agenda, where it is simply raised FYI to the attention of national bodies, inviting them to review and provide feedback on the proposal.
  2. Provisional acceptance by WG2: A proposal has been technically reviewed by WG2, with feedback from one or more national bodies and liaison organizations such as the Unicode Consortium, and a consensus has been reached within WG2 that a character, group of characters, or a repertoire for a script should be encoded. However, a final resolution specifying code positions and character names may not yet have been taken or approval may be postponed pending further feedback from national bodies regarding one or more technical issues in the proposal.
  3. Final acceptance by WG2 - in Bucket: A formal resolution has been taken by WG2, specifying code positions and character names for addition to the standard, but without necessarily determining which amendment they should be included in for formal balloting. This status is referred to as "being in the bucket," a holding category waiting for an appropriate amendment. This stage is transitional and typically is only used when WG2 has a meeting that does not authorize a formal amendment to 10646.
  4. Hold for Ballot in WG2: A formal resolution has been taken by WG2, specifying exactly which of any characters approved and "in the bucket" are to be balloted in an amendment to 10646. Note that in current practice, most character approvals move directly from Stage 1 to Stage 4, for efficiency, unless there is some technical issue with them or unless WG2 decides that it needs to wait before starting to progress a new amendment.
  5. SC2 Ballot: This stage comprises one or more formal ballots by the member bodies of the parent committee, SC2. During each ballot, member bodies and liaison organizations (such as the Unicode Consortium) review the collection of characters and scripts in the ballot document and provide technical and editorial feedback. After each ballot is completed, WG2 meets and resolves the comments. This stage can take a year to two years, depending on the schedule of ballots and of WG2 and SC2 meetings. Technical changes to the approved characters may still occur as part of this process, including the addition of characters that were not originally on the ballot.
    Stage 5 begins as soon as SC2 has approved the WG2 resolution to ballot some collection of characters and the Secretariat has issued the formal amendment ballot. In the first ballot phase the draft is known as a PDAM (Proposed Draft Amendment). After resolution of PDAM ballot comments by WG2, an FPDAM (Final Proposed Draft Amendment) is issued for ballot, followed finally by the resolution of FPDAM ballot comments by WG2.
  6. JTC1 Ballot: The parent committee, SC2, has approved a resolution to submit the DAM (Draft Amendment) for approval by the national bodies at the JTC1 level. During this stage, which is generally pro forma, there can be no technical changes to the ballot text. This is a two month ballot, and the issuance of the ballot is the point at which Unicode implementers can feel secure in implementing the corresponding, synchronized repertoire in the Unicode Standard. When the DAM ballot is approved, JTC1 considers the amendment fully approved, awaiting publication.
  7. ITTF Publication: An approved amendment to 10646 is submitted to ITTF for formal publication. An amendment to a standard (or the standard itself) is not actually considered an International Standard until ITTF has completed publication. Depending on the complexity of the standard and any editing issues which may turn up, this may take several months to more than a year from the completion of the DAM ballot itself.

As you can see, most of the characters in the Pipeline Table are in either Stage 4 or Stage 6.

In almost all cases, it is only the characters in Stage 6 which will be part of Unicode 5.0. There asre a few excepions this time, though:

097B..097C 2 DEVANAGARI LETTER GGA
DEVANAGARI LETTER JJA
2005-May-13
Accepted,
2005-Nov-04
Accelerated into Unicode 5.0
2005-Sep-15
Stage 4
097E..097F 2 DEVANAGARI LETTER DDDA
DEVANAGARI LETTER BBA
2005-May-13
Accepted,
2005-Nov-04
Accelerated into Unicode 5.0
2005-Sep-15
Stage 4

These four letters, added for the Sindhi language when written with the Devanagari script, have definitely been approved for inclusion in both standards, but there was a worry on the UTC side that member companies (such as Microsoft) that wished to provide implementations to sdupport the Sindhi language would be forced to wait way too long if the characters were only going to be added to a post-5.0 release. So, after consideration at the November 2005 UTC meeting, it was decided (since the characters themselves were not controversial and the need was important) to include them in Unicode 5.0.

To see the proposal to include these characters, you can look to Michael Everson's site and his N2934: Proposal to add four characters for Sindhi to the BMP of the UCS. In that document, Michael explains a bit about the history of these characters, which was known long before the decision to encode them:

These four Devanagari letters were not encoded in previous versions of the standard because it was thought that they bore a diacritical mark which could be unified with U+0952 DEVANAGARI STRESS SIGN ANUDATTA. That character, however, is not identical with the mark which distinguishes Sindhi implosive consonants; the unification was false. The two graphs behave quite differently in the Devanagari writing system: the underbar in Sindhi is often (and I suggest, best) fused with the stem of the letter, and the vowel signs U and UU are drawn beneath it (see figures 3, 4, and 5). The ANUDATTA, on the other hand, is a stress accent applied to the entire syllable, and accordingly is placed below, not above, U and UU (see figure 6). No “combining implosive” diacritic is proposed here for the four Sindhi letters, for simplicity in encoding.

Anyway, the proof Michael provided was considered convincing in both the UTC and in ISO, an the characters are now set to be included in Unicode 5.0. And the people who want to represent their Sindhi text with the Devanagari script have only to wait for the implementers to add them to fonts (or of course to not bother waiting and to add the letters themselves!).

 

This post brought to you by "", "", "", and "ॿ(097b, 097c, 097e, and 097f -- which will be the Sindhi implosives)


no comments

referenced by

2008/02/29 Newer, stronger, more case pair stability! The world's first 5.1 million dollar character encoding standard!

go to newer or older post, or back to index or month or day