>50% of the web is Unicode? Meh, I say. Meh.

by Michael S. Kaplan, published on 2010/09/19 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/09/19/10064647.aspx

Back in the end of August, president of the Unicode Consortium, Google engineer, and all around nice guy Mark Davis broke radio silence on his twitter account and tweeted the following:

Now everyone repeated this and you can see the "25 retweets" annotation. It sounds like really big news, and very cool.

First, I'll grab another tweet that has one of those stats of the type you hear all the time from Google exec types, this time mentioned by nomad411, someone else I follow in a tweet:

Now one can obviously presume that Google has everything on the Internet indexed. They say they do.

And obviously this quote isn't specifically talking about the Internet (though similar quotes from Google execs that this quote reminds me of often do!).

So, in an environment that claims that the amount of data will be literally more than doubled in 48 months, the fact that it really takes seven months to go from "just under 50%" to "just over 50%" (a time frame that by using these rough numbers would mean that there is perhaps a ~14% increase in the data on the Internet, so the fact that we only moved a few percentage points on the Unicode side seems worrying).

This seems a much more interesting question for Google to spend time on, so they can move more of the web -- why not a blog pointing out the nature of the new data that isn't using Unicode?

The figure for UTF-8 excludes pure ASCII. Pretty much what we've got now, according to Google, is 50% UTF-8, 20% pure ASCII, 20% Latin-1 (including Windows-1252, 8859-15, etc.), and 10% Other.

I expect a lot of that new data is mainly digits in relational databases.

I'd love to see details on the other data, rather than assuming/expecting what it may be....

Also, this is one of those cases where labels vs. content can be interesting -- knowing who is st least claiming to be using UTF-8 helps one know what % of the web is using Unicode without punishing the people who use English....

Well by "data" they could mean video and images, which obviously won't be in Unicode, and also takes up much more "space" and would count for a lot more than text (even if that text were in UTF-16 ;))

I want to know what fraction of the web is really UTF-8 but has a header that claims it's ISO 8859-1. Based on my personal experience, I'll bet it's pretty high.

That's what I mean -- the Google Blog article is a puff piece. Why not answer useful, interesting questions? :-)

Is there a way I can use Unicode to use certain letters/symbols, variations to construct my own language and be able to (type it out on word document) *** I know a lot about linguistics, its my career, lol, but when it comes to technology and/or computers I basically have 1% knowledge about them (comps) pls email me if u have the answer or any other solution, I can make my own letters easy, but I wan2 b able to use the comp, billy742001 at yahoo dot kom

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.