Every character has a story #12: U+2071 (SUPERSCRIPT LATIN SMALL LETTER I)

by Michael S. Kaplan, published on 2005/07/21 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/07/21/441206.aspx


This entire post below was authored by Ken Whistler and posted to the Unicode List at 7:37pm on July 20, 2005. I sit and watch Ken with awe and wonder at times like this. :-)

(Asmus Freytag inspired this post when he stated: We could create a new series UCN (Unicode Character Notes) that are numbered by code point and each address a single character. Two hours and ten minute later, Ken had written the following text)

Hmmmm...

UCN #2071

By Ken Whistler, character historian

U+2071 SUPERSCRIPT LATIN SMALL LETTER I

This character, while it might at first seem mundane and ordinary, has a colorful and amazing history of its own.

It is one of only two *letters* to be encoded among the original block of superscripts and subscripts -- sharing this honor with U+207F SUPERSCRIPT LATIN SMALL LETTER N -- but its route into the superscripts and subscripts block was entirely distinct from the little superscript n. (See UCN #207F.) Unlike superscript n, which gained its location by virtue of its association with the venerable Code Page 437 of IBM PC fame, superscript i had no such code page association, and was much later to the scene, having a Unicode derived age of 3.2, instead of 1.0.

Furthermore, unlike many another new character, U+2071 is nearly unique in that it went into a code point that had its own complex history, *before* U+2071 was actually encoded.

"Whatever could this mean?" you might say, and one could well expect confusion over such a concept as this, but here is how the matter stands. Careful examination of the superscript and subscripts block and its history will demonstrate that superscript zero is encoded as U+2070 SUPERSCRIPT DIGIT ZERO (see UCN #2070) and that superscript one is encoded as U+2074 SUPERSCRIPT DIGIT FOUR  (see UCN #2074) -- the association of the digit value with the last digit of the code point is not random or by happenstance, by the way. However, the expected extrapolation from this pattern would be that U+2071 would be SUPERSCRIPT DIGIT ONE. It is not, of course, because that particular character had the rare luck to be included in ISO 8859-1, whereby it gained first-chart character status as U+00B9 SUPERSCRIPT DIGIT ONE. (See UCN #00B9 for the full story on that "one".) As a result of this unique status, the code point \U2071 was the very first instance of an occasional device seen elsewhere throughout the standard: the systematic gap blind cross reference. As early as the publication of the Founding Book (The Unicode Standard, Version 1.0), the reserved code point at 0x2071 is shown with the now famous original convention: x (superscript digit one --> 00B9)

This pattern gapping and blind cross-referencing was the occasion of considerable discussion, and has resulted in much confusion down through the years about the standard. And \U2071 was the very *first* code point to use this convention, so can be seen as the archetypal instance of this phenomenon. Amazingly, the first code point using this convention referred to a character which itself denoted one!

But of course that is not the end of the story of U+2071. Unlike the code point \U2072, which to this day continues  to maintain its pattern gap blind cross reference, although in the modern formulation: --> 00B2 ² superscript two (see UCN #2072 for details), U+2071 now actually has an encoded character which supersedes the earlier blind cross reference. While not unique in that status, U+2071 is among a very small, but highly august class of code points to be able to make this claim. (See UCN #0600 for a similar story with its own, uniquely Middle-Eastern flavor.)

Nor does the tale end there, of course. There was a deep and impassioned argument about the proper emplacement of SUPERSCRIPT LATIN SMALL LETTER I, once it became clear that the importunings of the mathematical community could no longer be ignored and that all mathematical symbols, no matter how obscure, should be given their due in the standard. Now, superscript i was hardly obscure, of course -- it is commonly seen in mathematical treatises, but the usual assumption had been that superscript forms, whether of numbers, digits, or other symbols, should simply be represented as styled variants of existing characters.  Superscript i, however, escaped that generalization by  appearing in SGML entity lists, whose crossmapping imperative pushed it over into the realm of repertoire required for
character encoding.

Once that consensus had been reached, however, the committees were still at sixes and sevens, as it were, about the placement of superscript i. One faction fiercely argued for colonizing a hitherto untouched column and encoding it as U+2090. (See UCN #2090, which is a relatively short Unicode Character Note, but which remarks on this brief encounter with fame for that code point, rendering it a much more lively read than the downright dull UCN #2091.)

Another faction argued that the committees should observe the sanctity of prescriptions of pattern gapping and follow the precedent of "Adding Things at the End of the List", without creating *new* unexplained gaps, and so argued for U+208F. (See UCN #208F.) That faction also argued that this placement would serendipitously place superscript i in immediate chart proximity to its venerable antecedent, U+207F SUPERSCRIPT LATIN SMALL LETTER N. However, their argument was fatally weakened by the inability to convince
anyone of the felicity of adding a *super*script letter to the end of a list of *sub*script digits and punctuation.

The third faction argued for what amounted to no less that a shocking act of character integration -- breaking the colorblind cross reference barrier by inserting a superscript letter into what had formerly been a segregated area, reserved for digits only, *despite* the fact that the only unaccounted for digit that could move into the neighborhood was already living on the good side of town, as it were, at U+00B9. After a long argument, this faction prevailed. And effectively, U+2071 SUPERSCRIPT LATIN SMALL LETTER I became the Jackie Robinson of the Unicode Standard, forever shattering the segregationist exclusionary practices that had prevented such characters from moving into code points that had established blind cross-references.

In a further curious coincidence, U+2071 SUPERSCRIPT LATIN SMALL LETTER I bears more than a passing glyphic resemblance, at first glance at least, to U+00B9 SUPERSCRIPT DIGIT ONE. This means that newcomers to the standard often do a double-take when they view the superscripts and subscripts chart, as the character that they *expect* to see after U+2070 is a superscript one, and unless they look carefully, they might be fooled into thinking that U+2071 actually *is* a superscript one. In this respect, U+2071 SUPERSCRIPT DIGIT ONE has an additional unique status in the entire standard, of serving as a visual ghost of a vanished blind cross-reference to a character that appears almost the same as itself. No other character has this status, even among the small group of other characters that have crossed the cross reference barrier to appear in those formerly reserved code points. Some editors have argued that this ghostly and implicit graphic cross-reference should be finally acknowledged fully and demystified a bit by adding what would now be an explicit cross-reference to U+00B9. But how that might turn out, of course, is a matter only of current speculation and a topic for a future version of this Unicode Character Note.

Another thing worth mentioning about U+2071 SUPERSCRIPT LATIN SMALL LETTER I has a bearing on the fabulous collection of stories related to the Phonetic Extensions block, U+1D00..U+1D7F. In particular, U+2071 SUPERSCRIPT LATIN SMALL LETTER I is used not only in mathematical contexts, but also appears as a modifier letter in the Uralic Phonetic Alphabet. It would have been proposed as a character among that collection, except that the mathematicians got their proposal in and processed first. This accounts for why there is a U+1D4D MODIFIER LETTER SMALL G (see UCN #1D4D) and a U+1D4F MODIFIER LETTER SMALL K (see UCN #1D4F), both associated with the UPA collection of Unicode derived age 4.0, as well as the much more venerable U+02B0 MODIFIER LETTER SMALL H (see UCN #02B0) and U+02B2 MODIFIER LETTER SMALL J (see UCN #02B2), both associated with IPA and other phonetic collections of Unicode derived age 1.0, but astoundingly, there is no MODIFIER LETTER SMALL I in the standard! Thus U+2071 SUPERSCRIPT LATIN SMALL LETTER I shares with U+207F SUPERSCRIPT LATIN SMALL LETTER N the status of being the only Latin modifier letters in the standard named for their decomposition tag, rather than their modifier letter status. Strange but true!

So not only is U+2071 the Jackie Robinson of Unicode characters -- it also stands as one of the prime exemplars of the principle that you cannot derive all character properties from inspection of character names, nor assume that all characters in related groups of characters will have names constructed by identical patterns.

U+2071 SUPERSCRIPT LATIN SMALL LETTER I should also figure prominently in any list of hard-to-find characters, precisely because its history confounds so many expectations regarding where, exactly, one should search for it if casually perusing the charts or attempting to access it for input.

Vital Statistics:

2071;SUPERSCRIPT LATIN SMALL LETTER I;Ll;0;L;<super> 0069;;;;N;;;;;

For further details, consult the Unicode Character Database.

SUPERSCRIPT LATIN SMALL LETTER I is also identified as IBM GCGID LI011000, where it is named in the documentation as "i Small superscript".


# silverpie on 21 Jul 2005 8:28 AM:

One problem: official Unicode notes are always numbered in decimal, so unless that rule is also being overruled, it would have to be UCN #8305.

;)

# Michael S. Kaplan on 21 Jul 2005 5:18 PM:

Well, if they ever wanted to do Unicode character notes, it would make sense to have them by codepoint, and the convention for those is hex....

Though I will keep "Every charcacter has a story" posts here in the original format. :-)

# silverpie on 22 Jul 2005 11:11 AM:

Don't ya just love it when rules collide like that? ::)

# alan mcf on 24 Jul 2005 4:30 PM:

For us international readers, Ou est "Jackie Robinson"?

Alan

# A. Skrobov on 26 Jul 2005 7:20 AM:

"and that superscript one is encoded as U+2074 SUPERSCRIPT DIGIT FOUR"
Uh-oh... superscript one is encoded as SUPERSCRIPT FOUR?

# Michael S. Kaplan on 26 Jul 2005 9:59 AM:

Well, yes -- but this is one of those word choice ambiguities -- the difference between "that SUPERSCRIPT ONE" and "that superscript one" is that the latter refers to a specific named character while the former refers to one that exists outside the context of the phrase (ususlly mentioned earlier?)....

Michael Cysouw on 4 Oct 2011 2:55 AM:

Please add these story to

     http://decodeunicode.org/

That website was founded a few years ago to collect such information (it is run by typographers).

Michael S. Kaplan on 4 Oct 2011 8:03 AM:

I leave that chore for those [typographer] historians.... hopefully they cite their sources!


go to newer or older post, or back to index or month or day