Every character has a story #24: U+0308 (COMBINING DIAERESIS)

by Michael S. Kaplan, published on 2006/09/04 00:01 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2006/09/04/738263.aspx

I am reminded of a scene from the 1991 film The Doctor starring William Hurt, modified here to be a bit more linguistic than medical:

Linguist: Nancy, are my repeated vowels pronounced differently?
Nancy: No, doctor.
Linguist: That's funny, I always trema when you're near.

There are essentially two1 different traditions for the meaning of two dots on top of a vowel:

Umlaut - Described in Wikipedia as a "...modification of a vowel which causes it to be pronounced more similarly to a vowel or semivowel in a following syllable."

Trema or Diaeresis - Described in Wikipedia as the "...division of two adjacent vowels as two syllables rather than as a diphthong."

Now in Unicode and ISO 10646, these two very different diacritical purposes are unified under a single character -- the diaeresis. Which is kind of ironic given that the meaning of 'diaeresis' tends to suggest a division rather than any sort of unification....

Ignoring that bit of irony in the naming decision, a unification does make sense since they really do look pretty much the same, and a disunification would be a huge target for spoofing (something we really do not need any more of, frankly!). Though to tell the truth, in quality typography the umlaut dots are usually a bit closer to the letter than the trema dots.

Back in 1993,  Deutches Institut für Normung (DIN) sent a proposal to WG2 that stated

Currently, a substantial amount of existing German data distinguishes between Umlaut and Trema. Both diacritics have a similar, but not necessarily identical representation, both have quite different properties e. g. with regards to sorting (cf. DIN 5007).

In particular, German library data is currently stored according to ISO 5426 "Extension of the Latin alphabet coded character set for bibliographic information interchange" which distinguishes between the two diacritics Umlaut (4/9) and Trema (4/8). However, in ISO/IEC JTC1/SC2 N3125 "Finalized Mapping between Characters of ISO 5426 and ISO/IEC 10646-1 (UCS)" both are mapped to the same UCS character, U0308. There is thus no standardized way to ensure roundtrip compatibility between the two standards.

For Germany and in particular for its national library (Deutsche Bibliothek) it is imperative for the integrity of German data that it be possible to maintain the distinction between Umlaut and Trema also in the UCS in a standardized way. Lack of ability to do so affects millions of bibliographic data records in the Deutsche Bibliothek alone (to be exact, 14 956 289 records as of October 2002) and about 110 million bibliographic data records in German and Austrian regional library networks.

In other words, they had a need to distinguish these two diacritics, which are actually not unified in a different ISO standard. Their initial proposal from document N2593:

We therefore request

a) the encoding of two new characters, LATIN VARIATION SELECTOR UMLAUT in position U0241 and LATIN VARIATION SELECTOR TREMA in position U0240 (the positions are suggestions only).

b) the insertion of the following text into informative Annex F "Alternate format characters" as F.2.6 "Latin selectors"

"LATIN VARIATION SELECTOR UMLAUT (U0241): Uniquely identifies the preceding character as using /being the Umlaut diacritic (cf. ISO 5426, code position 4/9)

LATIN VARIATION SELECTOR TREMA (U0240): Uniquely identifies the preceding character as using / being the Trema diacritic (cf. ISO 5426, code position 4/8)

In the absence of any variation selector, neither the character COMBINING DIAERESIS U0308 nor any of the Latin letters with diaeresis can be interpreted as representing uniquely the Umlaut or uniquely the Trema.

The LATIN VARIATION SELECTOR UMLAUT or the LATIN VARIATION SELECTOR TREMA should only be used directly following the Latin characters shown below:



Neither the LATIN VARIATION SELECTOR UMLAUT nor the LATIN VARIATION SELECTOR TREMA carry a defined meaning when they follow any other character ."

c) Change in ISO/IEC JTC1 SC2 N3125 (= ISO/TC46/SC4 WG1), section 3 "Mapping of Characters" the table to:

4/8 Trema, Diaeresis 0308 0240
4/9 Umlaut 0308 0241Z

Unfortunately, Variation Selectors can only be used on base characters, not on combining characters. so while the scenario is valid, the DIN suggested soluion is not. The UTC discussed possible solutions at length before producing the following recommendation, instead:

While recognizing the drawbacks to all of the alternatives to encoding a new COMBINING UMLAUT character outlined in WG2 N2766, we believe that there is a workable alternative solution which has, to date, been overlooked. The solution consists, essentially, of using U+034F COMBINING GRAPHEME JOINER (CGJ), in its intended semantics in 10646/Unicode, to make the relevant sorting, searching, and data mapping distinctions required for umlaut versus tréma. In particular, the distinction we propose is:

CGJ U+0308>
U+0308> a umlaut
CGJ U+0308> a tréma

The sequences <
U+0308> and <a CGJ U+0308> are not canonically equivalent. this means that the distinction will not be normalized away on conversion in and out of bibliographic systems. This eases the interoperability problem. Both sequences will display as ä, as they should. Furthermore, the semantics of CGJ are such that it should impact only searching and sorting, for systems which have been tailored to distinguish it, while being ignored in other respects in interpretation.

The reason for treating the existing sequence <
a U+0308> as representing the umlaut
in German bibliographic systems, despite the name of U+0308 COMBINING DIAERESIS, is that this is the unmarked case, representing the vast majority of extant data. The marked form <a CGJ U+0308> should be utilized for the marked case in the data, namely the tréma, which is far, far less frequent in German bibliographic data. This minimizes the conversion and data rectification issues, and also guarantees that representations including CGJ will be uncommon in data converted out of the German bibliographic records.

The existence of separate representations for umlaut and for tréma, which are not canonically equivalent (and thus not neutralized by normalization processes in the data) enables German implementations which need to distinguish the two for searching and sorting, to systematically maintain weighting distinctions to do the right thing. <
U+0308> = <ä> can be treated as equivalent to <a, e> for sorting purposes, while the tréma <a CGJ U+0308> can be weighted as a secondary variant of <a> thus resulting in the desired behavior for such systems. Existing collations which do not distinguish tréma and umlaut in German data will continue to work exactly as they  currently do, since in default collation tables CGJ is ignored in weighting.

We believe that this proposed solution has the correct mix of technical attributes to enable the German library networks to make the required distinction, to correctly convert existing ISO 5426 bibliographic records, and to implement the desired sorting and searching behavior for German data represented directly in 10646/Unicode.

At the same time, this solution does not introduce incompatibilities or non-interoperability issues for other existing implementations of 10646/Unicode which handle German data.

It is again ironic that in the (rare) situation where an attempt to distinguish them is required that the default case is suggested as being the unlaut while exceptional case is the diaeresis. :-)

The use of a combining diacritic is still (to this day) controversial in Unicode when people unfamiliar with the standard who are native speakers of languages like Swedish or Finnish and who are asked to think of these standalone letters as equivalent to a different letter plkus a diacritic. The many people who would prefer all of the Indic languages to separately encode all the instances of base letter plus virama have an analagous complaint.


1 - At this point I will take judicial notice of the phenomenon known as the Heavy Metal Umlaut, described ad nauseum here. It is in fact the Heavy Metal Umlaut that inspired Cathy's desire for a bumper sticker that would say Stop Indiscriminate Umlauting!, although I find that approach to be a tad reactionary. The importance to our culture of Spin̈al Tap and Blue Öyster Cult is undeniable, as is the need to avoid fear of the reaper and to turn the volume up to 11.....


This post brought to you by  ̈ (U+0308, a.k.a. COMBINING DIAERESIS)

# Henrik Holmegaard, technical writer, mag.scient.soc. on Wednesday, October 04, 2006 5:29 PM:

[EDITED to add some line breaks to the continuous paragraph submitted --michkap]

Dear Michael,

The issue lies in ANSI Z39.47:1985 which unified omljud (Swedish) / Umlaut (German) with diæresis or diaeresis (UK English) or dieresis (US English) and trema, cf document N2613. North Europe is home to three language families, Germanic, Slavonic, and Uralic. The Germanic languages have a richer palette of vowels than does English. Denmark, Iceland, and Norway lost Æ, æ in ISO-IEC 10646:1993 in which these were voted diphthong ligatures, that is, presentation forms.

A defect report was lodged and with a helping hand from Michael Everson a vote was returned in 1995 which restored our monophthong letters. ISO 8859-1 omits Œ, œ which are deemed diphthong ligatures, and yet French is an official language of the International Organization for Standardization. Consider this, and not for shelf-based titles but for disk-based titles, whether Apple PDD 1992, Adobe PDF 1993, or Microsoft XPS 2007, how many speakers of a Germanic language in North Europe would accept the Unicode non-unique character coding model for Latin and the OpenType composition model for mark attachment in a revival of Dr Martin Luther's Bible, or the Bible in Swedish which followed the same model? First, a US national standard for transliteration of writing systems in the Latin script other than English unifies what should not have been competently unified, then the US National Body upholds ANSI Z39.47 over ISO 5426 despite commitment to the contrary, and finally the US National Body rejects a request for a simulation by a combining character sequence of two and instead proposes a combining character sequence of three! Full phrase access for an XPS and OpenType revival of Luther?

In the late 1960s a storm raged in the North Sea and the UK papers carried on the covers the call-out, "The Continent isolated from England". The off-shore communities in the UK and the US tend to distinguish betweeen the English language and European languages, meaning languages which depend on diacritics and ligatured digraphs. But the offshore communities forget that English is a minority in the European Union which is officially multilingual, and in which English is only one of twenty official writing systems. A European Union survey showed that aggregate language skills are highest in Scandinavia, The Netherlands and Luxembourg and lowest in the UK and Ireland, trailing behind Turkey which has the poorest aggregate language skills of the possible accession states.

The TrueType concept of a difference between a coded character and a composed glyph and the ColorSync concept of a difference between a coded color and a separated colorant both assume that semantic character entities and semantic color entities are correctly encoded, because no transformation file, whether by rotation or by allocation, can reproduce on the output side what cannot be defined on the input side. Providing the coding is correct, and providing the transformation is device independent in and device independent out such that the transformation inverts from coded characters to composed glyphs to coded characters, and from coded colors to separated colorants to coded characters, then modern data processing models are collaborative by definition, because they preserve the intentions authors committed intact for the remote audience.

This aspect calls for humility on the part of those who have the unilateral power to deny others the right to commit intentions completely and correctly, whether to coded characters or to coded colors is in that sense immaterial.

Kind regards,

Henrik Holmegaard teknisk skribent, mag.scient.soc.
Tølløsevej 69, 2700
Brønshøj København, Danmark

(PS a Papal letter permitted construction of the church up the hill in 1183. In the letter penned in the Curia at Rome three thousand kilometers distant by foot and by horse, the place name is Latinized as Brunshuga, an early form of ascii-ification perhaps -:)).

# Michael S. Kaplan on Thursday, October 05, 2006 12:18 AM:

I can't be the only person who sees the irony in the above (especially if you take original text that was one continuous paragraph with no line breaks!)....

# Henrik Holmegaard, technical writer, mag.scient.soc. on Thursday, October 05, 2006 2:29 AM:

Dear Michael,

In the coding of the e-mail, paragraph breaks were used.

 In the 1980s I worked for a chartered translation company that localized for software publishers. The company employed IBM System 80 (DisplayWriter) and IBM PS/2 (Word and WordPerfect) platforms. In interchanging text, one gets to know the difference between a hard and a soft break -:).

Sorry to hear the interface between the sending application and the receiving application went wrong, in whatever way.

And thank you for posting the above. Adobe Systems did not post that the Adobe Type 1 composition model for sixteen years has corrupted information interchange in the European Union.

Kind regards,

Henrik Holmegaard teknisk skribent,

# Henrik Holmegaard, technical writer, mag.scient.soc. on Friday, October 13, 2006 12:53 PM:

Dear Michael,

After reading the documents of Working Group 2, I still find myself confused with conceptualizing coding of Carl Linnæus, and any other titles which originated in the North European Enlightenment and were also translated into French as the more common of the working languages of the eighteenth century. Some of these titles found their way into the library of Thomas Jefferson, and hence into the Foundation Collection of the Library of Congress. <paragraph break>

Document N2753, ISO/IEC JTC1/SC2/WG2 Meeting 45, Minutes dated 26 December 2004, Section 9.6 discusses the request by DIN Deutsche Institut für Normung for definition of a code point for Combining Umlaut in order to permit public disunification, the rejection of the request for public disunification, and the proposal for a public disunification more complicated than by a code point for Combining Umlaut. <paragraph break>

Since a code point for Combining Umlaut would not normalize to a pre-composed code point, coding Umlaut would imply that author and audience accepted two coded characters for one grapheme. Since German has 100,000,000 million speakers, or the community of French speakers and English speakers combined, and since the six graphemes in question comprise between 3% and 5% of communication and composition, disunifying Umlaut and dieresis/diaeresis/diæresis would appear to be an unpalatable alternative.

Thus the meeting resolved that, "With reference to documents N2766 and N2819 on German Umlaut, WG2 recommends the use of Combining Grapheme Joiner (CGJ, 034F) + Combining Diaeresis (0308) to represent the Tréma character where such distinction is needed. WG2 instructs its editor to include an appropriate informative note with CGJ description in Annex F, in Amendment 1 to ISO/IEC 10646: 2003." <paragraph break>

In the minutes of the meeting, Die Deutsche Bibliothek added a request (n) "In the Unicode Glossary -- Umlaut and Diaeresis are equated -- would like to have that fixed." Asmus Freytag as Liason for the Unicode Consortium answers (o), "Please propose specific changes to Unicode and we can entertain changes." <paragraph break>

The resolution effectively restates ANSI/NISO Z39.47-1993 which in Table 6 on page 7 states, "7-bit col/row 6/8. 8-bit col/row 14/8, name: umlaut (diaeresis), example of use: öppna." The latter is Swedish for "open" and the diacritic is an omljud (Ger. Umlaut). So if I understand the minutes of the meeting of Working Group 2, diaeresis now signifies umlaut, and Combining Diaeresis normalizes to a pre-composed character which is nice for North Europe. <paragraph break>

On the other hand, when German and French are coded in the same document and a distinction is to be preserved between Umlaut and diaeresis, the French graphemes spedified in Appendix A, Table A1, ANSI/NISO Z39.47-1993, should be coded with a sequence of three combining characters which does not normalize into a pre-composed character (sic). In order for full phrase access to function, the author must make sure the audience is aware of this in advance or full phrase access must fail. <paragraph break>

It is customary to distinguish between simulation of calligraphy in late fifteenth century typography and simplification of typography in the middle years of the sixteenth century. Programmable splines which support dynamic diacritic attachment, table-based mappings between coded characters and composed glyphs, and further table-based mappings between incomplete coded characters which combine into  complete coded characters for presentation purposes is far, far more complex than Gutenberg's composing case with its three hundred sorts.  <paragraph break>

For Early Modern English and Modern English this is of no consequence, but for the other official writing systems of the European Union there are consequences of the above, and also of the transliteration into English and the character set of English for access points to the catalogue of the universal character set. Although Italian does not depend in diacritics and digraphs, aggregate language skills in Italy are low and acceptance of non-Italian access points to the catalogue likely to be equally low. <paragraph break>

Kind regards, <paragraph break>

Henrik Holmegaard

# Henrik Holmegaard, technical writer, mag.scient.soc. on Monday, November 27, 2006 11:31 AM:

Dear Michael, <paragraph break>

Character-glyph transforms and color-colorant transforms take the following types, roughly : pre-evaluated versus evaluatable (i.e. with output rendering options), non-invertible and invertible (i.e. with mapping back from input to output back to input, whether character code points or color code points), and evaluatable transforms in which the value added logic is in the transformation file versus evaluatable transforms in which the value added logic is in the application. <paragraph break>

Unicode, like PostScript, was conceived before pagination models abandoned the PostScript concept of streaming access with page-dependence caused to macro dictionaries. Unicode and PostScript data processing models are capable of an evaluatable transform, but Unicode, like PostScript, is not capable of an invertible transform. The Unicode character model combines graphic characters (what Joseph Beckers calls 'fragment glyphs') into graphemes from which the audience cannot deduce the constituent characters for full phrase softcopy document retrieval. <paragraph break>

In other words, the Unicode data processing model is device dependent in and device dependent out. The parallel is the PostScript data processing model for object level color-colorant transformation, the Color Space Array for input to the PostScript color connection space CIEXYZ and the Color Rendering Dictionary for output to colorants. <paragraph break>

The PostScript color-colorant data processing model has been called 'the photographer's and lithographer's harakiri' because the Color Rendering Dictionary is device independent in and device dependent out which makes it impossible in principle to know what the press will print : no inversion, no proofing -- ouch. Thus a CRD is prohibited as OutputIntent in PDF/X-3. The Unicode character-glyph data processing model could by the same token be called 'the typographer's harakiri', because the author and the audience must establish a prior agreement on which graphemes reference which character sequences. <paragraph break>

There is no evidence, whether explicit or implicit, that the Unicode Consortium has clarified that its character-glyph model is mutex with full phrase document retrieval for the Latin, Greek and Cyrillic scripts, or that the Unicode Consortium has clarified in collaboration with national libraries across the European Union and the United States how to bring about a situation in which the coding of softcopy documents for submission as legal deposits can apply the same principles as the coding of softcopy documents which will only be submitted as hardcopy documents, the softcopy being flushed from the memory of the raster image processor at print time. This issue is being taken before the national bodies of the International Organization for Standardization for prereading at present, and after prereading the issue will be taken before the Commission. The person on copy for Microsoft Corporation is Michael Stokes who was co-architect of Apple ColorSync, founding editor of the ICC Specification, and co-architect of sRGB. I have a lot of respect for Michael Stokes. <paragraph break>

Kind regards,

Henrik Holmegaard

iccabc project, third edition

# Henrik Holmegaard on Monday, November 27, 2006 5:01 PM:

Dear Michael,

Again, it is good to see that commercial censorship does not apply either at Apple Computer or at Microsoft Corporation, as it does at Adobe Systems.

Kind regards,

Henrik Holmegaard

# Jack on Tuesday, October 16, 2007 1:34 PM:

I am trying to find examples in text of dyphthongs oe OE with little success (I have ae AE) so that I can paste them into text. I can find them printed in dictionaries but never in on-line versions. Any ideas?

# Henrik Holmegaard on Sunday, October 21, 2007 5:55 PM:

"I am trying to find examples in text of dyphthongs oe OE with little success (I have ae AE) so that I can paste them into text. I can find them printed in dictionaries but never in on-line versions. Any ideas?"

The same glyph Æ, æ, represents a ligated digraph pronounced as a diphthong in French and a ligated digraph pronounced as a monophthong in Danish, Færoese, Icelandic, and Norwegian.

The French diphthong has no independent place in French collation, but the Nordic monophthong has an independent place in Nordic collations.

In French the glyph Œ, œ, represent a ligated digraph pronounced as a diphthong which has no independent place in French collation.

Æ, æ, Œ, œ, are called digraphs and are considered digraphs in the Xerox Coded Character Set of 1981, and in the Apple and Adobe character sets that followed.

Æ, æ, are called letters in ISO-IEC 8859-1:1987, but as Œ, œ, are considered ligatures by all concerned they are not included in ISO-IEC 8859-1:1987.

In ISO-IEC 10646-1:1993, Æ, æ, are considered ligatures. Denmark and Norway submitted a Defect Report and in 1995 Æ, æ, became letters.

In Apple TrueType 2.0:1994 and in Microsoft OpenType:1997, the unsettled situation led to inclusion of a Dipthong Ligature rendering intent that composed Æ, æ, Œ, œ, as type stylistics when AE, ae, OE, oe, were coded as text semantics.

Meanwhile, LATIN SMALL LIGATURE OE and LATIN CAPITAL LIGATURE OE have been included in ISO-IEC 10646 as one and only one code point each, and although they are called ligatures in the standard they operate as letters.

Why is this important? When a server publishes a coded character string with or without floating format markup formatting, the application that maps the character codes into glyph codes does so using the CMAP Character Map table of TrueType 1.0:1990 and higher.

The CMAP Character Map table provides a rendering from character codes to default glyph codes, whereas ligation is supported by secondary tables, the Apple MORX Metamorphosis tables and the Microsoft GSUB Glyph Substitution tables.

To the extent that ligation is intended at the level of type stylistics, it is not supported in electronic mail and in HyperText Markup Language. But since Æ, æ, Œ, œ, is intended at the level of text semantics, it is supported in electronic mail and in HTML.

Confused -:).

Best wishes,

Henrik Holmegaard

# Henrik Holmegaard on Sunday, October 21, 2007 6:04 PM:

Michael wrote:

"I can't be the only person who sees the irony in the above ..."

There are two problems, the first is that language teachers will not accept diæresis when what they want is umlaut / omljud, and the second is that combining sequences of alphabetic character codes and accent character codes tend to be called in one way or another as workaround. The best bet seems to be to take it to http://www.eurfedling.org/ and see what they make of it, once they understand the implications of the separation of semantics from stylistics, and the further implications which are that what we commit our intentions to in writing is the character code, not the key code and not the glyph code.


Henrik Holmegaard

# Michael S. Kaplan on Sunday, October 21, 2007 6:12 PM:

If I type a document, they will accept it. Not sure what teachers would refuse to accept this?

# Henrik Holmegaard on Monday, October 22, 2007 11:29 AM:

Michael wrote:

"If I type a document, they will accept it. Not sure what teachers would refuse to accept this?"

The use case is interactivity for invertible Unicode imaging in class. The teacher shows the class the glyph, and shows the class how to invert from the glyph code, which is a stylistic entity, to the character code, which is a semantic entitiy.

The teacher explains that the nice thing about computer composition is that alternative stylistics can be applied to the same semantics, whether for informal 1:1 typography or for formal 1:1, 1:1, or x:1 typography in which spelling and searching is preserved.

The teacher further explains the simple set theory of the separation of semantics from stylistics, for instance, as defined by the discussions before and after the Apple World Wide Developer Conference in May 1992.

The teacher uses e.g. the Character Palette in OS X in which there is a command, Show Character Selected in Application. Works for CMAP Character Map inversion, but not for MORX or GSUB inversion which is also not necessary for informal typography.

The teacher - whether in primary or a secondary school - knows that umlaut / omljud is not the same as diaeresis / trema. And her colleagues in the staff room know that umlaut / omljud is not the same as diaeresis / trema.

What is she supposed to do? Tell the students that there is no difference? That won't work as if one does not know the difference one is not likely to be able to sit successfully for a paper at university level in preparation for a secondary school teaching post.

Economies are shifting from authoring, archiving and accessing documents by composed stylistics to authoring, archiving and accessing documents by coded semantics, and education systems should follow suit.

Curricula should, consequently, include introductions to Unicode imaging.


Henrik Holmegaard

# Michael S. Kaplan on Monday, October 22, 2007 11:34 AM:

And yet education in Germany did not collapse with 8859-1 out there, or in the many years of Unicode.

Please prove that people cannot do their jobs here, and if you can try to do it with smaller comments that ordinary folks such as myself can read. You have not convinced me that the solution suggested by Unicode is in any way whatsoever inadequate, in fact as far as I can tell you are simply defending the difference that they already acknowledged and suggested a solution for!

# Henrik Holmegaard on Tuesday, October 23, 2007 2:07 AM:

Michael wrote:

"Please prove that people cannot do their jobs here, and if you can try to do it with smaller comments that ordinary folks such as myself can read. You have not convinced me that the solution suggested by Unicode is in any way whatsoever inadequate"

The burden of proof is on the salesman, not on the customer. Microsoft Corporation has no model for interactive invertible Unicode imaging.

Let the Ministries of Education in Germany, Sweden, Finland ... do surveys among primary school and secondary school staff, so there is a solid foundation for discussion.

Moreover, Finnish Standards are of the opinion that a Windows decomposed keyboard is a solution for input methods in the multilingual European Union.

The argument advanced by Finnish Standards does not consider inversion at all, nor do the endless arguments by John Hudson, Microsoft consultant.

Since what is searched is the character code, not the key codes and not the glyph code, searching becomes device dependent.

http://www.typophile.com/node/16229 :

"Keyboard layouts frequently make the encoding model invisible to the user: he presses a key and gets Ẹ́ and that gets rendered using one of several possible character->glyph options. He may neither know nor care whether that diacritic is encoded as one, two or three characters."

Hiding the fact that repurposing involves multiple character codes per algorithmic grapheme is in terms of product declaration to be termed what?

What is the difference between what Adobe did to invertible imaging in 1990 and what Microsoft is doing to type invertible imaging today?

Paul Nelson, Program Manager Fonts and Globalization, Microsoft Corporation, posted the project in q2 2004 that OpenType was never intended for inversion.

Paul Nelson also pointed the project to the architecture for supporing invertible glyph runs in XML - an architecture Microsoft has patented.

Come on, get real - there are no surveys, there are no open guides to configuring for compatible assumptions, there is only massive US marketing.

Henrik Holmegaard,

technical writer, mag.scient.soc.

# Michael S. Kaplan on Tuesday, October 23, 2007 2:09 AM:


# Henrik Holmegaard on Tuesday, October 23, 2007 8:47 AM:

Michael wrote :

"as far as I can tell you are simply defending the difference that they already acknowledged and suggested a solution for."

I have had two telephone conversations with Herr Heuvelmann of Die Deutsche Bibliothek who is a prereading participant, like Gregory Hitchcock and Michael Stokes of Microsoft Corporation, the Executive Committee of the European federation of language councils, and standards bodies in North Europe.

In the first telephone call I checked whether Herr Heuvelmann had a concept of the difference between arbitrary accent attachment for a hardcopy document which is authored and accessed by its composed stylistics and a softcopy document which is authored and accessed by its coded semantics. He only intended the solution for internal Library use.

I reported this in late 2006 to standards prereaders, and I have recently reported this again as the prereading moves from the monolingual English character naming model to the indeterminate character coding model which the American National Standards Institute supported in ISO DIS 10646.

ABC Local Letters, available to prepreading participants in the prereading archive, includes the objection against an indeterminate imaging model lodged by the European Computer Manufacturers Association against Unicode 1.0. The original of the objection is in the academic archives of the University of Virginia.

This is not a game.

Best wishes,

Henrik Holmegaard

# Michael S. Kaplan on Tuesday, October 23, 2007 9:59 AM:


None of this answers the question I gave here about the actual solution given.

I give up.

# Henrik Holmegaard on Tuesday, October 23, 2007 12:39 PM:

Michael wrote,

"You have not convinced me that the solution suggested by Unicode is in any way whatsoever inadequate, ..."

You have not convinced me that the solution is adequate for a class context in which students and staff are involved - in the European Union, not in the United States.

(1) The character naming model is English only,

(2) there is no searchable character name component for 'umlaut' in Unicode itself, and

(3) the concept of decomposability, in so far as it is involved, is culturally incorrect.

A non-graphic control character to distinguish diaeresis which means diaeresis from diaeresis which means umlaut is the next hurdle.

If this were the scenario for a technical writing contract I'd pass, because I could not conceivably come up with a palatable approach.

Best wishes,

Henrik Holmegaard

technical writer, mag.scient.soc.

# Michael S. Kaplan on Tuesday, October 23, 2007 12:51 PM:


There are many times and many languages (some in the EU, even!) where the letter has multiple pronunciations. Yet there is no mechanism to support separate byte for the two pronunciations.

And in both ISO 8859-1 and in Unicode (until that recommendation) there has been no distinction between umlaut and diaeresis, and plenty of textbooks were written, even in Germany, and printed -- either with no distinction or by using a separate font -- treating it as a TYPOGRAPHIC distinction.

Now, after a request came in for it, a method to distinguish them in plain text has been provided. It is now done and anyone can use it. Not a single person has come to either WG2 or to the UTC complaining it is inadequate. I just have you inflicting words on me that I do not understand that make me want to disable comments to this post forever!

Unicode is not designed to do what you want, neither is ISO 10646. This is what they provide, and how they work. I am sorry if you don't like that, but don't you have to take that up with them, not to the people who agree with them and have already moved on such as myself?

# Henrik Holmegaard on Wednesday, October 24, 2007 3:03 AM:


1. Unicode character codes are interactively available on Mac OS X and on Microsoft Windows (ISO-IEC 8859-1 was not not interactively available until the current version of OS X).

2. So, since encoding of softcopy should be correct, the difference between umlaut / omljud and diaeresis / trema should be configured by the softcopy author.

3. So, German and Swedish authors should use diaeresis when they mean umlaut / omljud, and other authors should use a combining control character plus a supporting SFNT when they mean diaeresis / trema.

4. How, in practice, would a Microsoft OpenType implementation look? Please, in your reply provide a step-by-step description of how to set up the GSUB tags in order to support the solution arrived at by the Unicode Technical Committee - we will implement your reply.

Herr Heuvelmann accepted a solution which was internal to the machine readable cataloguing of Die Deutsche Bibliothek, forgetting that with random access pagination models such as PDF and XPS, and softcopy cataloguing solutions by Apple and Google, the issue is not cataloguing of bibliographic information for hardcopy documents (for which the PostScript pagination was flushed from memory when the page is rendered), but softcopy documents in which the text semantics encoded by the author is the turning point of full phrase cataloguing.

Please, my name is Henrik.

Best wishes,

Henrik Holmegaard

technical writer, mag.scient.soc.

# Michael S. Kaplan on Wednesday, October 24, 2007 6:48 AM:

I am not going to provide you with the step-by-step of making a font (this is hiardly what I do?), though I can suggest the OpenType list for that kind of detailed knowledge.

However, I do know it is possible and even easy given the way fonts work? Asking someone who dioes work with fonts how to make one character+CGJ+diacritic have a different glyph than one character+diacritic is an easy operation that works just fine in PDF 3which actually works with the resukting glyphid values....

# Henrik Holmegaard on Wednesday, October 24, 2007 8:59 AM:


The intention was to be polite and let Microsoft specify the method of making the SFNT.

There are people on the prereading list who make type, and the project has proposed production of both an Apple TrueType and a Microsoft OpenType implementation.

For the OpenType approach, if the UTC approach applies, then according to the above it applies to authors who should encode diaeresis with a combining character sequence.

What language councils and ministries of education think of this is an open question. Of course, Keld Simonsen and Erkki Kolehmainen who are cited in the UTC minutes are on copy.

Best wishes,

Henrik Holmegaard

# Michael S. Kaplan on Wednesday, October 24, 2007 9:45 AM:

Understood -- as long as you keep in mind that Microsoft is over 80,000 people so asking me for something that is outside my area when there are so many others around might make the kind of sense of walking up to a security guard or janitor in some random MS building and asking them for the same thing. :-)

Given what is accomplished today in fonts, I imagine asll that is left is better communicating the requirement before you would see the update in lots of different fonts....

# Henrik Holmegaard on Sunday, February 24, 2008 11:30 AM:

On 12 August 2007 a proposal was put forward by a participant in the discussion, with a copy to Apple and Microsoft. The proposal was in ISO 14651 to equate base plus combining e above with base plus combining diæresis above and the precomposed forms for diæresis, when the meaning of the diacritic is to change the way the vowel is sounded rather than the way the vowel is stressed. While this solves the problem at a higher level of searching, it does not solve the problem at a lower level where the enduser selects characters by character names. It is not possible to sit for a paper in linguistics, if you can't tell the difference between diæresis and umlaut.

The issue is the assumption that metadata models must be in English translation and in English transliteration. Joseph Becker touches on this assumption in relation to ANSI X3.4 American Standard Code for Information Interchange in his Unicode proposal of 1988. As noted elsewhere, an IBM Selectric, a Monotype composition system, a Linotype composition system, and a Heidelberg printing press do not involve issues of character interaction and character identification, because the author and the audience merely interact with the mark, not the meaning of the mark. A conceptual layer that is not conceived of in traditional telegraphy, and that is certain not conveived of in traditional typography, is needed.



Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2008/03/22 *Insert a pun involving the word TREMA here*

go to newer or older post, or back to index or month or day