"Does my buttload look too big for that stream?" (from the Tales of the "That's what she said!" files)

by Michael S. Kaplan, published on 2010/04/28 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/04/28/10002896.aspx

The whole thing really takes me back.

It takes me back over four years to one of my favorite blogs (What do you get when you combine a base character with a buttload of diacritics?).

It was just a few months back in Most combining characters in a Unicode glyph/character/whatever that Shawn was talking about a kind of related issue:

Recently on the Unicode list someone asked basically what the biggest number of combining characters could happen in a sequence. It's as many as someone wants to use, though the normalization UTS15 adds a limit, and the font rendering problem gets weird.

I had soime people ask me about that blog, I guess you could say it waas those people who took me back, specifically. :-)

Now the font rendering issue Shawn mentioned is something I already talked about, and that "buttload" blog even showed how fonts exist that can kinda handle even extreme cases, such as the aforementioned "buttload" scenario -- even as others cannot.

Now there is no UTS (Unicode Technical Standard) 15; Shawn was actually referring to UAX (Unicode Standard Annex) 15 (Unicode Normalization Forms), specifically section 21 (Stream-Safe Text Format), whifch states:

There are certain protocols that would benefit from using normalization, but that have implementation constraints. For example, a protocol may require buffered serialization, in which only a portion of a string may be available at a given time. Consider the extreme case of a string containing a digit 2 followed by 10,000 umlauts followed by one dot-below, then a digit 3. As part of normalization, the dot-below at the end must be reordered to immediately after the digit 2, which means that 10,003 characters need to be considered before the result can be output.

Such extremely long sequences of combining marks are not illegal, even though for all practical purposes they are not meaningful. However, the possibility of encountering such sequences forces a conformant, serializing implementation to provide large buffer capacity or to provide a special exception mechanism just for such degenerate cases. The Stream-Safe Text Format specification addresses this situation.

D7. Stream-Safe Text Format: A Unicode string is said to be in Stream-Safe Text Format if it would not contain any sequences of non-starters longer than 30 characters in length when normalized to NFKD.

    * Such a string can be normalized in buffered serialization with a buffer size of 32 characters, which would require no more than 128 bytes in any Unicode Encoding Form.
    * Incorrect buffer handling can introduce subtle errors in the results. Any buffered implementation should be carefully checked against the normalization test data.
    * The value of 30 is chosen to be significantly beyond what is required for any linguistic or technical usage. While it would have been feasible to chose a smaller number, this value provides a very wide margin, yet is well within the buffer size limits of practical implementations.
    * NFKD was chosen for the definition because it produces the potentially longest sequences of non-starters from the same text.

Okay, so for this one scenario (when the stream-safe text format is needed), the unbounded case is limited to 30 of those combining characters.

The word "buttload" in my original blog was meant to imply a very large number without giving specific bounds though some bounds are obviously implied.

Just the other day Gweneth was in a meeting I was in and she was amused by my use of the "indefinite adjective" use of buttload in such cases.

This 30 character limit is obviously shorter than the example I used in What do you get when you combine a base character with a buttload of diacritics?, which would mean that "my" buttload is not stream-safe; it is simply too big for that stream. :-)

I will try not to take it too personally.

My butt reportedly doesn't look too big in my pants, so I think I'm okay with this one blog anomaly, that does not cross over into my social life.

Shawn also mentioned the "user character" that was represented by the largest well-known grapheme cluster in Unicode, which is:

U+0f67 U+0f90 U+0fb5 U+0fa8 U+0fb3 U+0fba U+0fbc U+0fbb U+0f82

also known as:

TIBETAN LETTER HA +
TIBETAN SUBJOINED LETTER KA +
TIBETAN SUBJOINED LETTER SSA +
TIBETAN SUBJOINED LETTER MA +
TIBETAN SUBJOINED LETTER LA +
TIBETAN SUBJOINED LETTER FIXED-FORM WA +
TIBETAN SUBJOINED LETTER FIXED-FORM RA +
TIBETAN SUBJOINED LETTER FIXED-FORM YA +
TIBETAN SIGN NYI ZLA NAA DA

also known as:

HAKṢHMALAWARAYAṀ

also known as:

ཧྐྵྨླྺྼྻྂ

which is kind of a useless ink smudge for me, perhaps you can see it better.

Maybe we can turn up the font size a scosh:

ཧྐྵྨླྺྼྻྂ

Better? Looks great here!

It really is a beautiful script. And that bit of text is certified stream safe!

Now the interesting bit about this one (you'll only see if you have a font like Microsoft Himalaya) can be noted if you put it alongside some text:

ABCDEཧྐྵྨླྺྼྻྂedcbaཧྐྵྨླྺྼྻྂabcdeཧྐྵྨླྺྼྻྂEDCBA

(Look here if you can't see it in your browser but want some idea of what the hell I'm talking about!)

Clearly many of these subjoined beasties are underneath the main bit of the stack, well below the baseline -- enough that this particular well-known grapheme cluster isn't even using the space that a full uppercase letter could use.

It seems to me like there oughtta be things that could be done to make Tibetan more usable on Windows, but the full scope of what would be required momentarily escapes me. It would probably take an effort akin to the one I described in Want to hear about a cool new typographic convention? Khmer, and I'll tell you about it... for Khmer, which would really require forces outside of us to see the work done. Forces inside Tibet, for instance....

Note that the current implementation of collation in Windows does not allow a compression (i.e. a UCA contraction) of more than eight UTF-16 code units, which means that the collation of the HAKṢHMALAWARAYAṀ is probably not going to be exactly right.

So that one character's butt is a bit large for the jeans one might try to fit it in, on Windows!

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day