On disliking Emoji, disrespecting code pages, and not looking past dogma

by Michael S. Kaplan, published on 2010/12/28 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/12/28/10108950.aspx

It will come as no surprise to people who know my "World Readiness" persona that I am not so fond of the Emoji (like those added to Unicode 6.0).

For those who read this blog, it has come up before in the past here.

I don't like them, because they feel to me like such a betrayal of some of the core principles that I used to espouse.

And I still believe that.

But I like enough of what Unicode does that I can see past that.

I mean, here at Microsoft, for every Kin there's a Windows Phone 7, for every Vista there's a Windows 7, for every Bob there's also a Kinect, and so on. Overall I like the stuff that comes out of Microsoft, even though the working for them part feels more ordinary these days.

And I still believe that (both of those thats).

Now I think about the time I was on the NLS team (back when they were actually still called the NLS team) where everyone pushed so hard to get people off of legacy code pages and into using Unicode.

And everyone has been quite consistently on message for that.

I still believe that message is a correct one, and I believe there is no place for new code pages for languages.

But when I think of the Emoji, when I know that some people will probably want to use and support them (like the phone and instant messaging and email and so on), and when I think of files like Unicode's EmojiSources.txt from the Unicode Character Database, I wonder if maybe my belief system, and the belief system of all of those who have been espousing the above, might be missing out on something obvious.

From that file's header:

# EmojiSources-6.0.0.txt
# Date: 2010-04-24, 00:00:00 GMT [MS]
# Unicode Character Database
# Copyright (c) 1991-2010 Unicode, Inc.
# For terms of use, see http://www.unicode.org/terms_of_use.html
# For documentation, see http://www.unicode.org/reports/tr44/
# This file provides mappings between Unicode code points and sequences on one hand
# and Shift-JIS codes for cell phone carrier symbols on the other hand.
# Each mapping is symmetric ("round trip"), for equivalent Unicode and carrier
# symbols or sequences. This file does not include best-fit ("fallback")
# mappings to similar but not equivalent symbols in either mapping direction.
# Note: It is possible that future versions of this file will include
# additional data columns providing mappings for additional vendors.
# Format: Semicolon-delimited file with a fixed number of fields.
# The number of fields may increase in the future.
# Fields:
# 0: Unicode code point or sequence
# 1: DoCoMo Shift-JIS code
# 2: KDDI Shift-JIS code
# 3: SoftBank Shift-JIS code
# Each field 1..3 contains a code if and only if the vendor character set
# has a symbol which is equivalent to the Unicode character or sequence.

If the Japanese telcos or those products I mentioned or whoever needs to map from their various proprietary mappings to and from Unicode, then that is essentially what code pages are all about.

Perhaps being dogmatically against code page support is really not such a good idea.

Perhaps the focus should have been on language support (which really requires Unicode) and that Microsoft has to support things like GB 18030 anyway, and not been so against the concept of code pages for mappings that can still make sense.

Like the vendor mappings between Emoji and Unicode, whatever they may be.

I mean, every time one of my friends with an iPhone (and there are a lot of them) sends a tweet via Twitter and there smiley face emoticons are private use area characters, I know that Microsoft is not alone -- Apple is making the exact same stupid mistake that Microsoft is, albeit in a different way.

In my opinion, there should be symbol mappings added to a brand new code page (or if necessary multiple code pages) to support this key scenario.

Claiming that Emoji are crucial (which many people do claim) and not providing the proper mappings between them and the random crap that people are using throughout the world because of a frenzied dogmatic belief that code pages are evil and so no new code pages should be supported is a really bad product decision coming out of a really bad belief system.

With that said, I have minimal say in what these product groups do. Many of them read here and listen to what I say, shortly before they ignore it and do what they had already decided what to do.

In that way I am like a not-as-well-paid version of Ray Ozzie, whose thoughts on issues such as privacy and career stage profiles and the Cloud are brilliant and deserve better than to be discounted by the huge percentage of MS discounting them....

So I would be truly surprised to see things supported this way when the time comes to do the work to support Emoji. No one wants to admit they took a belief too far, since that would mean implicitly admitting one was wrong about something (and who wants to do that when the next review is in their minds?).

I'm no better, mind you; I doubt I'd be writing this particular blog if I was still on the NLS team.

Perhaps it is time to move on to a good idea, instead. That would be much better than forcing everyone to roll their own.....

John Cowan on 28 Dec 2010 1:55 PM:

Unified ideographs are actually defined the same way as emoji: with reference to the union of particular existing de facto or de jure standards, aka character sets.  The answer to "What is U+4E00?" is "It is GB 5027, JIS 1676, KSC 7673, ..."

Michael S. Kaplan on 28 Dec 2010 5:20 PM:

Indeed. And we actually have those mappings for one and not the other,  for the reasons given....

Random832 on 28 Dec 2010 5:55 PM:

I think of them as ideographs in a very early stage of evolution. No, I'm serious... Some characters that we would never think of as being "not text" are still pictures, more or less, of what they mean. I'm reminded of a recent blog post by Raymond Chen that mentioned in passing that 車 is basically a top view of a car. From a certain point of view, you can still see the ox head in the latin capital letter A. And in 123 the graphical relationship to tally marks can still almost be seen - it's more clear in ١٢٣, even more in 一二三.

In a way, emoji - and the 'emoticons' haphazardly built up out of normal text elements that came before them - are a natural consequence of the typewriter. Before, even in written communication, tone could be conveyed by the little details of handwriting. "Plain" text wasn't so plain, and good luck encoding _that_ in Unicode. Now all that's left to us is the choice between Times Roman and Comic Sans. In many places even that is gone.

Miguel Sousa on 29 Dec 2010 12:51 AM:

And every time someone sends me a smiley face using Outlook what I get is a uppercase J... blogs.msdn.com/.../10033725.aspx

Mihai on 29 Dec 2010 3:19 PM:

My understanding is that the various Japanese phones are using the JIS equivalent of PUA.

So before adding new code pages we should take a look at the stuff coming out of the Japanese handsets.

Is the text even tagged somehow? Is there a way to know if the text comes from a SoftBank or DoCoMo phone?

If not, then adding code pages is pretty much pointless, because there is no way to know the "JIS flavor".

My feeling about Emoji is similar yours (don't like the idea).

They might be useful, but in reality the Japanese vendors don't care about them.

No, strike that: they don't want them. Because (I think) they don't want interoperability

(if they wanted that, they would have tried to achieve it in the "JIS PUA" first)

But as it is, they can encourage social networks to stay in the same vendor ("see, if your friend is not with us, you can't use Emoji")

And having more (and cooler) Emoji is a selling point, so adding more (incompatible) symbols is an almost sure thing.

So adding stuff to Unicode, or trying to add "standard" code pages is just trying to play catch up with and industry that does not want interoperability or standards.

jmdesp on 29 Dec 2010 4:27 PM:

This in some ways is really worrisome. Additional code pages would only partially solve the problem, since I can imagine each of those extensions will regularly be tagged as simply SJIS, and not Docomo-SJIS, KDDI-SJIS or SoftBank-SJIS.

Yuhong Bao on 30 Dec 2010 2:00 AM:

"This in some ways is really worrisome. Additional code pages would only partially solve the problem, since I can imagine each of those extensions will regularly be tagged as simply SJIS, and not Docomo-SJIS, KDDI-SJIS or SoftBank-SJIS."

Yep, it is already a problem, as Shift_JIS-2004 and codepage 932 both uses the same user-defined character range.

go to newer or older post, or back to index or month or day