>60% of the web is Unicode? Meh, I say. Meh.

by Michael S. Kaplan, published on 2012/03/01 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2012/03/01/10275202.aspx

Back on September 19, 2010, in >50% of the web is Unicode? Meh, I say. Meh., I reported on the report from Mark Davis that as of August 9, 2010, Google found that > 50% of the text it was indexing was Unicode.

Now, a mere 19 months later, a new Google blog entry, authored by the same Mark Davis titled Unicode over 60 percent of the web, explains that the new number is > 60%.

So, in an environment that claims that the amount of data will be literally more than doubled in 48 months, the fact that it really takes nineteen months to go from "just over 50%" to "just over 60%" (a time frame that by using these rough numbers would mean that despite the ~40% of increase in the data on the Internet, so the fact that we only moved twenty percentage points on the Unicode side seems worrying).

This seems a much more interesting question for Google to spend time on, so they can move more of the web -- why not a blog pointing out the nature of the new data that isn't using Unicode?

To me, that graph mostly just makes it obvious both how little non-english content there is on the web, and (surprisingly!) how little non-english content providers care about Unicode (note how little of the non-Latin, non-ASCII was eaten by UTF-8). I expect a lot of the rise of UTF-8 was simply WordPress (et al) "smartening" formatting characters on posts, meaning pages incorrectly tagged as Latin-* pretty quickly show up as broken. Or does WordPress try to output matching Character-Encoding? I don't know PHP well enough to know how it deals with encodings.

I don't for a moment believe that the number of web pages in Google's indexes is growing anywhere near as fast as the total amount of data recorded anywhere by anyone for any purpose, so those numbers simply aren't comparable.

It's clear, looking at the graphs, that the growth of UTF-8+ASCII share is almost entirely at the expense of Latin-1-style encodings (including Win-1252, 8859-9, 8859-15, etc.), which have shrunk from a 2006 peak of 35% to today's 10%. Note that Google counts the encoding actually in use, not the declared encoding. Mark mentioned to me that the figure for UTF-16-style encodings, which might be thought to be a confounding factor, is less than 0.1%.

John, I wouldn't say they aren't comparable -- one is a microcosm of the other. So more like they are apples vs. apple orchards. :-)

Since more than half of the Internet is Unicode, how about finally changing the default Notepad encoding from "ANSI" to UTF-8?

3/5 users are apparently actively changing it already.

The other 2/5 (at least 1/5) users are likely ignorant of encoding anyway.

I often visit a few telnet-based BBS forums that uses Big5 internally. The forums also supports web and NNTP access. The encoding supported in these two "other way" are, you guessed it, also in Big5.

They could have used iconv to do the conversion, but I really couldn't blame them for not doing this. If their internal datastore (I can't tell if they use database or some arbitary storage schemes) stores data in Big5, it's a bit pointless to do conversion if their user don't need it at all.

(trying again, seems my comment didn't make it through yesterday)

I'm not sure you're interpreting the numbers right. For me they say that the Unicode part grows faster than the total amount of data on the internet. Calculating based on your numbers, during those 19 months the non-Unicode part grew by ~12% while the Unicode part grew by ~68%; that looks pretty good to me...

Very nice catch, Gert-Jan! Unicode pages ARE growing nicely, by ~68%! That's just fine if non-Unicode pages only increased by ~12% over the same 19 (?) month interval. A rate of growth in Unicode pages that exceeds that of the web overall during the stated time interval is great.

Not-meh, I say. Not-meh at all.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.