>50% of the web is Unicode? Meh, I say. Meh.

by Michael S. Kaplan, published on 2010/09/19 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/09/19/10064647.aspx


Back in the end of August, president of the Unicode Consortium, Google engineer, and all around nice guy Mark Davis broke radio silence on his twitter account and tweeted the following:

 That link is to the older article on the Google Blog (Unicode nearing 50% of the web) from January of this year.

Now everyone repeated this and you can see the "25 retweets" annotation. It sounds like really big news, and very cool.

I've decided that I am not really all that impressed.

Sorry, Mark.

I should explain why I am not impressed.

First, I'll grab another tweet that has one of those stats of the type you hear all the time from Google exec types, this time mentioned by nomad411, someone else I follow in a tweet:

Now one can obviously presume that Google has everything on the Internet indexed. They say they do.

And obviously this quote isn't specifically talking about the Internet (though similar quotes from Google execs that this quote reminds me of often do!).

So, in an environment that claims that the amount of data will be literally more than doubled in 48 months, the fact that it really takes seven months to go from "just under 50%" to "just over 50%" (a time frame that by using these rough numbers would mean that there is perhaps a ~14% increase in the data on the Internet, so the fact that we only moved a few percentage points on the Unicode side seems worrying).

Why is so much of the data being created today on the Interent not in Unicode?

Note that ASCII is UTF-8, etc.

This seems a much more interesting question for Google to spend time on, so they can move more of the web -- why not a blog pointing out the nature of the new data that isn't using Unicode?


John Cowan on 19 Sep 2010 8:01 AM:

The figure for UTF-8 excludes pure ASCII.  Pretty much what we've got now, according to Google, is 50% UTF-8, 20% pure ASCII, 20% Latin-1 (including Windows-1252, 8859-15, etc.), and 10% Other.

I expect a lot of that new data is mainly digits in relational databases.

Michael S. Kaplan on 19 Sep 2010 8:30 AM:

I'd love to see details on the other data, rather than assuming/expecting what it may be....

Michael S. Kaplan on 19 Sep 2010 4:53 PM:

Also, this is one of those cases where labels vs. content can be interesting -- knowing who is st least claiming to be using UTF-8 helps one know what % of the web is using Unicode without punishing the people who use English....

greenlight on 20 Sep 2010 3:53 AM:

Well by "data" they could mean video and images, which obviously won't be in Unicode, and also takes up much more "space" and would count for a lot more than text (even if that text were in UTF-16 ;))

parkrrrr on 20 Sep 2010 6:49 AM:

I want to know what fraction of the web is really UTF-8 but has a header that claims it's ISO 8859-1. Based on my personal experience, I'll bet it's pretty high.

Michael S. Kaplan on 20 Sep 2010 7:00 AM:

That's what I mean -- the Google Blog article is a puff piece. Why not answer useful, interesting questions? :-)

kasey on 11 Nov 2010 8:55 PM:

Is there a way I can use Unicode to use certain letters/symbols, variations to construct my own language and be able to (type it out on word document) *** I know a lot about linguistics, its my career, lol, but when it comes to technology and/or computers I basically have 1% knowledge about them (comps) pls email me if u have the answer or any other solution, I can make my own letters easy, but I wan2 b able to use the comp, billy742001 at yahoo dot kom


referenced by

2012/03/01 >60% of the web is Unicode? Meh, I say. Meh.

2011/02/07 You might almost say that Gmail got Ć¾wned

go to newer or older post, or back to index or month or day