by Michael S. Kaplan, published on 2012/03/01 03:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2012/03/01/10275202.aspx
Back on September 19, 2010, in >50% of the web is Unicode? Meh, I say. Meh., I reported on the report from Mark Davis that as of August 9, 2010, Google found that > 50% of the text it was indexing was Unicode.
Now, a mere 19 months later, a new Google blog entry, authored by the same Mark Davis titled Unicode over 60 percent of the web, explains that the new number is > 60%.
I'm still feeling kind of meh about this.
I'm just stuck back on that same point:
So, in an environment that claims that the amount of data will be literally more than doubled in 48 months, the fact that it really takes nineteen months to go from "just over 50%" to "just over 60%" (a time frame that by using these rough numbers would mean that despite the ~40% of increase in the data on the Internet, so the fact that we only moved twenty percentage points on the Unicode side seems worrying).
Why is so much of the data being created today on the Internet not in Unicode?
Note that ASCII is UTF-8, etc.
This seems a much more interesting question for Google to spend time on, so they can move more of the web -- why not a blog pointing out the nature of the new data that isn't using Unicode?
What are the ~20% of new pages that aren't in Unicode?
Simon Buchan on 1 Mar 2012 4:14 AM:
To me, that graph mostly just makes it obvious both how little non-english content there is on the web, and (surprisingly!) how little non-english content providers care about Unicode (note how little of the non-Latin, non-ASCII was eaten by UTF-8). I expect a lot of the rise of UTF-8 was simply WordPress (et al) "smartening" formatting characters on posts, meaning pages incorrectly tagged as Latin-* pretty quickly show up as broken. Or does WordPress try to output matching Character-Encoding? I don't know PHP well enough to know how it deals with encodings.
John Cowan on 1 Mar 2012 10:26 AM:
I don't for a moment believe that the number of web pages in Google's indexes is growing anywhere near as fast as the total amount of data recorded anywhere by anyone for any purpose, so those numbers simply aren't comparable.
It's clear, looking at the graphs, that the growth of UTF-8+ASCII share is almost entirely at the expense of Latin-1-style encodings (including Win-1252, 8859-9, 8859-15, etc.), which have shrunk from a 2006 peak of 35% to today's 10%. Note that Google counts the encoding actually in use, not the declared encoding. Mark mentioned to me that the figure for UTF-16-style encodings, which might be thought to be a confounding factor, is less than 0.1%.
Joshua on 1 Mar 2012 11:22 AM:
I don't write unicode until the file in question contains a non ISO-8859-1 character.
Michael S. Kaplan on 1 Mar 2012 4:00 PM:
John, I wouldn't say they aren't comparable -- one is a microcosm of the other. So more like they are apples vs. apple orchards. :-)
Craig on 1 Mar 2012 5:49 PM:
Since more than half of the Internet is Unicode, how about finally changing the default Notepad encoding from "ANSI" to UTF-8?
3/5 users are apparently actively changing it already.
The other 2/5 (at least 1/5) users are likely ignorant of encoding anyway.
cheong00 on 1 Mar 2012 6:12 PM:
I often visit a few telnet-based BBS forums that uses Big5 internally. The forums also supports web and NNTP access. The encoding supported in these two "other way" are, you guessed it, also in Big5.
They could have used iconv to do the conversion, but I really couldn't blame them for not doing this. If their internal datastore (I can't tell if they use database or some arbitary storage schemes) stores data in Big5, it's a bit pointless to do conversion if their user don't need it at all.
Gert-Jan on 2 Mar 2012 7:26 AM:
(trying again, seems my comment didn't make it through yesterday)
I'm not sure you're interpreting the numbers right. For me they say that the Unicode part grows faster than the total amount of data on the internet. Calculating based on your numbers, during those 19 months the non-Unicode part grew by ~12% while the Unicode part grew by ~68%; that looks pretty good to me...
EllieK on 3 May 2012 5:34 PM:
Very nice catch, Gert-Jan! Unicode pages ARE growing nicely, by ~68%! That's just fine if non-Unicode pages only increased by ~12% over the same 19 (?) month interval. A rate of growth in Unicode pages that exceeds that of the web overall during the stated time interval is great.
Not-meh, I say. Not-meh at all.
go to newer or older post, or back to index or month or day