by Michael S. Kaplan, published on 2008/06/14 16:19 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/06/14/8598294.aspx
Over in the Suggestion Box, regular reader Jan Kučera asks:
The other day, I was thinking... what would have Michael done differently, if there was no ASCII yet? If he could design encoding from scratch given all he knows today?
I still have some doubts whether I should ask that but let's try to place the question... so any thoughts?
Wow, I am sure there are many of the Unicode elders who have some strong opinions about this one!
If we were starting from scratch and it were up to me, I don't think I'd try to show too much imagination -- the requirements of being compatible with every country and company out there might still be around.
For all of the various ways of doing things in Unicode, whether one if thinking about normalization or combining classes or canonical equivalence or casing or properties -- really anything -- you can look at Unicode and see many of the wear marks as you see things done one way in some cases, and in another way in others.
Or maybe Jan is really asking even before that -- literally before there was even ASCII?
That is even scarier.-- I have a hard time believing that I (or really anyone) would be able to convince all of the people to think ahead to the need to support every script in the world, past and present. Sure I could make the arguments, but I doubt anyone would be willing to listen.
So in the end, I think I would let things unfold as they did but concentrate on all of the weird cases where people now look at Unicode and just point out things that could have been done better -- and just do those things better from the start. With the benefit of hindsight before things have happened, encoding would probably be a lot cleaner, if nothing else. :-)
You could probably fill in computer language names or runtime library implementations or API definitions or many other items here too for this kind of hypothetical -- would someone starting Win32 from scratch do it all the same? Probably not -- many of the "warts" that come from the project having been around so long and worked on by so many people would be able to be fixed -- and then we would just wait for the new warts to form once we run out of hindsight!
Once I ran out of hindsight, I (and everyone else in this "secret knowledge" club would be required to retire or prove that they deserve to stay. Because being able to look at a problem in retrospect and discern the best solution is a great skill, but is NOT the same skill as that involved with making new decisions that have no prior precedent, and assuming that the rock stars in one space will even be competent in the other is likely to hasten the creation of the new warts.
Lest we forget, it is our rock stars we have now who wrote most of the existing warts -- along with a lot of good work, too!
All of the characters in Unicode, owing their existence to Unicode existing in its current form, have jointly agreed to not sponsor this post. The latest attempt to create a "Character's Union" was narrowly defeated and everyone breathed a sign of relief as the Cyrillic Local AFL-CIO chapter was not able to be formed...
orcmid on 15 Jun 2008 12:35 AM:
Having lived in a time before ASCII, I have some perspective on that.
What you need to appreciate was that ASCII wasn't even a full 7-bit character set originally, and it was driven in large part by the need to clean up teletypewriter codes.
You also need to take into consideration the limitations of media at that time. The 8-bit byte was not the norm yet, and most binary computers used 6-bit codes for characters (and not all of the code points had available graphics - 48 printable characters were considered an abundance). Also, memories were *small*. 32k 36-bit words was a large memory and telling people to go to 8-bit codes or, worse, 16 bit codes for only a small number of usable code points would have made no sense at all.
More importantly, magnetic tape tended to use 6 bit frames (with parity), telecommunication was largely serial in 6-7 bits, and printers and displays did not handle large character sets.
So the opportunity was not present. Also, the emergence of the IBM System/360 computer messed things up by introducing a sparsely-populated 8-bit EBCDIC code with a version of ASCII relegated to a rarely-used "compatibility" bit in the program state.
This basically spoiled the opportunity to establish ASCII until the end-run by minicomputers (often equipped with ASCII-using teletypewriters and, later, the DEC VT52 ASCII-based alphanumeric display and other ASCII-based clone displays). These were not bitmapped devices and the character codes and their glyphs tended to be wired in, although burning your own character-generator ROMs was not unheard of.
Microprocessors generally used ASCII (and 8-bit bytes) but it was the IBM PC being an ASCII-based computer that foretold the ultimate opportunity to migrate to Unicode, large fonts, and so on, but only after the pain of code pages and families of ISO 8-bit codes. After all, we were starting with machines where the high-end products had 64 kilobytes.
The serious availability of economical laser printers was also important, and the H-P Laserjet did not reach the market until 20 years after ASCII, and font capabilities were limited. Dot-matrix printers had been around a long time but they didn't have the resolution. Today's low cost, high-quality color inkjet printer was unimaginable in the early reign of ASCII.
So circumstances would not have allowed for Unicode much before it actually arrived and I am not sure that today's hindsight would have been understandable without going through the process that got us where we are today.
There were a large number of technologies that had to advance together for us to arrive at a place where we could be complaining about the defects of Unicode and how we could do it over better given the chance (an opportunity I do not anticipate).
orcmid on 15 Jun 2008 11:16 AM:
An afterthought: There were some early efforts with wide character codes though I don't know how that helped fertilize Unicode, for good or ill. Joe Becker, who was the Mr. Unicode Software Scientist at Xerox XSoft when I arrived there in 1992, would have far more insight into at least one case. (I'm not sure I ever met Joe, who was operating in quiet isolation as far as I could tell. A missed opportunity there.)
I suspect that those who worked on Xerox Interpress and Adobe Postscript could also provide perspective on how larger character repertoire's arose in the context of printing/publishing systems.
I'm not sure that we've achieved a common perspective on code points, character sense, glyphs and culture that one could claim that there is a consensus on what a clean approach to Unicode would be. (I must dig up that incomplete post that yours inspired me over some time last year.)
John Cowan on 15 Jun 2008 5:11 PM:
In the alternative universe of Ill Bethisad, computing arises in trilingual Ireland (Irish, Brithenig, English) rather than in North America, and although the initial character set is 8-bit, the concept of combining characters is present from the start, although they precede their bases (in the manner of dead keys) rather than following them. The 8-bit code is extended without too much pain to a 16-bit code that handles all the scripts of the world in a similarly Cleanicode-ish way (e.g. Han ideographs are expressed in the style of IDSes).
Jan Kučera on 28 Jun 2008 12:11 PM:
thanks Michael for thoughts and orcmid for valuable comments. Well I didn't live in those times but the only thing which keeps itself in my mind is, that memory was small and expensive.
I wasn't expecting anything concrete from this topic, but you have noted some good points.
The other thing specific to this area is that you would need a lot of people from lot of places around the world to be able to collect everything you would like (or need) to know. And that was not easy either.
I also agree that convincing all the people to support every script, past present and maybe future as well is indeed not really likely, but at least now you can point to the past when designing new things and say, hey, these guys hadn't believed either! :)
Volker Hetzer on 13 Oct 2009 4:30 AM:
I was studying when unicode was still young and our professors complained bitterly about the variable length encodings.
Basically, in order to acess the n-th character you have to go through all n-1 characters before.
It kills character based hashes and makes scanners and parsers more difficult.
About the only thing the unicode designers thought of was rendering and displaying, with everything else falling off the table.
Michael S. Kaplan on 13 Oct 2009 9:07 AM:
Patently untrue, and there are hundreds of gigabytes of documents that speak to the other issues, much more per capita than those professors have published.
Though if I may say so those professors only like theoretical problems easily solved -- and real world problems that in this case affect the whole world they run from under the "those who can, do; those who can't, teach" doctrine.
I'm not an apologist (and glad that in this case no apologist is required), but some of those professors should apologize to their students for their utter lack of rigor....
go to newer or older post, or back to index or month or day