No UTF-8 in a VARCHAR column

by Michael S. Kaplan, published on 2007/07/27 02:51 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/07/26/4075990.aspx

This is actually quite a common request but the answer is not one that will make Francisco happy -- there isn't such a way.

I will forward the feedback on to folks on the SQL Server team (or they might just read about the request here since I think some of them are regulars now!), though it is worth pointing some things out:

First of all, for pure ASCII text, UTF-8 will always be smaller, but for any other text in Unicode it will be the same size or bigger, which really hurts the size argument.

The above is imperfect since it includes unassigned code values and it also includes high/low surrogates in the three byte group when they are not legal there, but you get the point that UTF-8 is not much of a space saver unless you stay in ASCII, and if you do stay in ASCII then any old code page would work....

Second of all, even if there were a way to make UTF-8 text work, it would not be free since the underlying engine and collation tables use UTF-16 for Unicode, there would be an associated performance hit due to the conversions. Now conversion is optimized as much as it can be, but it is still a non-zero cost.

And finally, though we do not change code pages, there have really on a regular basis been updates to UTF-8 to conform to the latest Unicode Standard guidelines around conformant UTF-8 conversion, which means that if such a feature existed there would be worries about index corruption if faulty, non-conformant UTF-8 was being stored. This would have to be a real concern, enough to make sure that invalid data was never stored in the database (suggesting a validation pass, which means additional work that would have to happen).

It is a common scenario to have most of the data in the ASCII range, while only having Unicode occasionally. In such cases, UTF-8 has a significant size advantage over UTF-16.

This is the reason many prefer UTF-8 as the Unicode serialization format. Consider the waste of space if everyone saved their .txt files in UTF-16 - most text is plain ASCII.

UTF-8 is compelling when you only have non-ASCII characters occasionally and you don't want to deal with code pages and the like - that's so 80's :)

Indeed, UTF-8 is the natural choice for Europe. We've got Latin-1, Latin-2 and the Euro sign €, but 90% of a random text will be < U+0080. For your typical XML, the choice is even more obvious. They always contain a lot of ASCII characters, and often tons of whitespace.

Free tip: Be green and strip useless whitespace from XML. You save on storage, bandwidth and processing. Profile it if you don't believe me, we were suprised.

> This is the reason many prefer UTF-8 as the Unicode serialization format.

But a database is not usually the place for "serialization" -- a database usually contains *raw* text. When you're talking XML or HTML then, sure, most of that is actual markup and hence pure ASCII. But for raw textual data, you're only going to have "mostly ASCII" when your primary language is English. And if your primary language is English, then you don't need Unicode anyway.

>> But for raw textual data, you're only going to have "mostly ASCII" when your primary language is English. And if your primary language is English, then you don't need Unicode anyway.

Dean, this is not strictly true.

Consider a directory listing of drive C: on the computer of a Hebrew speaking user. The vast majority of files are system files and the characters in their filenames are expected to be in the ASCII range, while a minority of files (those under My Documents, etc.) would be in Hebrew characters. It is clear that it is compelling to store such a listing in UTF-8 encoding.

This is not an example of something you'd put in a database, but there might be other things that you would that would have the same characteristic - the fact the user's native language is not English does not have to mean most of the data won't be ASCII.

Dean, the real world just doesn't work like that.

Here's a simple example, a test corpus I'm very familiar with, the US census TIGER data. It's primarily English, in that it's a survey of the (mostly man-made) geography of the US. However since it includes Puerto Rico it is not "pure ASCII" and any attempt to store it in SQL Server as ASCII is hopelessly ill-conceived.

The "regular basis" that Michael's talking about consists of one update brought about by a change in policy from Unicode itself, arbitrarily limiting the range of valid characters forever to those that can be expressed in UTF-16, and an earlier update to prohibit behaviour that had previously been implementation specific and was now considered unacceptable as a result of security considerations. Neither changed the meaning of any existing non-test data, and neither should have required any changes to well written software, for much the same reason that no-one loses sleep about the "Hangul mess". Whether SQL Server is in fact "well written" is an exercise for the reader.

Microsoft's own UTF-8 implementations have changed a lot more than the description above would imply, not because the standard is unclear or incomplete, but just because their original attempts diverged so much from the actual standard. Until relatively recently Internet Explorer would fail trivial UTF-8 test suites because whoever implemented the decoding not only failed to pay attention to the section labelled "Security Considerations" but apparently didn't even read the original napkin sketch algorithm for decoding UTF-8. Did this really have security implications for IE users? Probably, but finding serious security bugs in a web browser is like shooting fish in a barrel, the more serious issue is that this implementation never obeyed the standard, and no-one cared enough to even test it.

So in a way maybe the lack of UTF-8 in SQL Server is a blessing in disguise, if they did implement it, would you trust them to have got it right ?

Should only support UTF-8. Direct mapping to UTF-16. Always more efficient, except for some obscure Indian languages.

Cost of storing and retrieving nulls dwafs cost of conversion. In any case, to be correct conversion needs to be done to UTF-32, not UTF-16 -- there are much more than 64K characters. UTF-16 should die. Varchar with code pages should definitely die.

Postgresql gets this right.

I just use varchar for performance, and loose I18N.

Back on planet earth, backcompat breaks of the scale this would cause are unprecedented and in light of the SQL Server team's feelings on this matter highly improbable.

Also, 3-byte characters in UTF-8 is a lot more than just "obscure Indian languages", so you must not be doing very much language stuff....

On an OS that uses UTF-16 LE for everything, the most efficient is NEVER going to be UTF-8.