The disunification of Norwegian and Danish sorting ( SQL Server 2008 Edition!)

by Michael S. Kaplan, published on 2008/03/27 07:01 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2008/03/27/8338664.aspx


Content of Michael Kaplan's personal blog not approved by Microsoft (see disclaimer)! 

The story behind today's blog post started in Why do we call w 'double u' -- doesn't it look more like a 'double v'?, where I talked about the Swedish Academy's change to the way the letters W and V were to be handled in collation, and the impact on Microsoft software when this change eventually makes it to the point where it needs to be integrated.

In particular I discussed the implications for software such as Jet [Red] and SQL Server, which unify Swedish and Finnish collations together into one single collation, when the two collations become different. Obviously there are far-reaching implications here that have to be carefully considered.

And that one day this "theoretical" issue that is a punch line in a blog post from Raymond or I would have far reaching design consequences....

A couple of days later, I noted in The disunification of Norwegian and Danish sorting that there was actually already such a change in Vista -- a change in Norwegian collation that was not happening in Danish (two collations which had also been unified in Jet [Red] and SQL Server), a change that had already gone into Vista.

Obviously with each of these changes it is only a matter of time before the issue has to actually be addressed, but neither is theoretical -- like pregnancy, at some point the situation will force itself on people and cannot be ignored. And to continue the ridiculous pregnancy analogy for just one this one more sentence you are reading now, the Norwegian "baby" is much further along than the Swedish one....

Now on top of all of this, factor in Not all in sync quite yet (aka SQL and the CLR and Windows and .NET), which spent some time talking about the consequences of these various collation solutions that move further and further apart. And in particular looked more specifically at the Danish/Norwegian disunification issue and SQL Server, at the collations covering the two languages in SQLS 2005, and what to worry about going forward....

Now of course if you care about SQL Server, you probably noticed (though when I say this it is with full knowledge that at least one regular reader to whom it is crucial missed it!) my blog On changing the world, or at least the way people order things in it, where the Windows Server 2008 collation support in SQL Server 2008 feature had achieved line item status, and obviously the whole Danish/Norwegian disunification issue must have been solved!

And indeed it has.

If you take the query I used in this post:

SELECT
    name,
    COLLATIONPROPERTY(name, 'CodePage') as CodePage,
    CONVERT(binary(4), COLLATIONPROPERTY(name, 'LCID')) as LCID,
    CONVERT(binary(4), COLLATIONPROPERTY(name, 'ComparisonStyle')) as ComparisonStyle,
    description
FROM ::fn_helpcollations()
WHERE
    COLLATIONPROPERTY(name, 'LCID') = 1030 OR
    COLLATIONPROPERTY(name, 'LCID') = 1044 OR
    COLLATIONPROPERTY(name, 'LCID') = 2068

(this query that returns 20 rows in SQL Server 2005) and run it that CTP of SQL Server 2008 to see what it says.

And here it is:

name codepage lcid comparisonstyle description
Danish_Greenlandic_100_BIN 1252 0x00000406 0x00000000 Danish-Greenlandic-100, binary sort
Danish_Greenlandic_100_BIN2 1252 0x00000406 0x00000000 Danish-Greenlandic-100, binary code point comparison sort
Danish_Greenlandic_100_CI_AI 1252 0x00000406 0x00030003 Danish-Greenlandic-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive
Danish_Greenlandic_100_CI_AI_WS 1252 0x00000406 0x00010003 Danish-Greenlandic-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-sensitive
Danish_Greenlandic_100_CI_AI_KS 1252 0x00000406 0x00020003 Danish-Greenlandic-100, case-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive
Danish_Greenlandic_100_CI_AI_KS_WS 1252 0x00000406 0x00000003 Danish-Greenlandic-100, case-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive
Danish_Greenlandic_100_CI_AS 1252 0x00000406 0x00030001 Danish-Greenlandic-100, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive
Danish_Greenlandic_100_CI_AS_WS 1252 0x00000406 0x00010001 Danish-Greenlandic-100, case-insensitive, accent-sensitive, kanatype-insensitive, width-sensitive
Danish_Greenlandic_100_CI_AS_KS 1252 0x00000406 0x00020001 Danish-Greenlandic-100, case-insensitive, accent-sensitive, kanatype-sensitive, width-insensitive
Danish_Greenlandic_100_CI_AS_KS_WS 1252 0x00000406 0x00000001 Danish-Greenlandic-100, case-insensitive, accent-sensitive, kanatype-sensitive, width-sensitive
Danish_Greenlandic_100_CS_AI 1252 0x00000406 0x00030002 Danish-Greenlandic-100, case-sensitive, accent-insensitive, kanatype-insensitive, width-insensitive
Danish_Greenlandic_100_CS_AI_WS 1252 0x00000406 0x00010002 Danish-Greenlandic-100, case-sensitive, accent-insensitive, kanatype-insensitive, width-sensitive
Danish_Greenlandic_100_CS_AI_KS 1252 0x00000406 0x00020002 Danish-Greenlandic-100, case-sensitive, accent-insensitive, kanatype-sensitive, width-insensitive
Danish_Greenlandic_100_CS_AI_KS_WS 1252 0x00000406 0x00000002 Danish-Greenlandic-100, case-sensitive, accent-insensitive, kanatype-sensitive, width-sensitive
Danish_Greenlandic_100_CS_AS 1252 0x00000406 0x00030000 Danish-Greenlandic-100, case-sensitive, accent-sensitive, kanatype-insensitive, width-insensitive
Danish_Greenlandic_100_CS_AS_WS 1252 0x00000406 0x00010000 Danish-Greenlandic-100, case-sensitive, accent-sensitive, kanatype-insensitive, width-sensitive
Danish_Greenlandic_100_CS_AS_KS 1252 0x00000406 0x00020000 Danish-Greenlandic-100, case-sensitive, accent-sensitive, kanatype-sensitive, width-insensitive
Danish_Greenlandic_100_CS_AS_KS_WS 1252 0x00000406 0x00000000 Danish-Greenlandic-100, case-sensitive, accent-sensitive, kanatype-sensitive, width-sensitive
Danish_Norwegian_BIN 1252 0x00000406 0x00000000 Danish-Norwegian, binary sort
Danish_Norwegian_BIN2 1252 0x00000406 0x00000000 Danish-Norwegian, binary code point comparison sort
Danish_Norwegian_CI_AI 1252 0x00000406 0x00030003 Danish-Norwegian, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive
Danish_Norwegian_CI_AI_WS 1252 0x00000406 0x00010003 Danish-Norwegian, case-insensitive, accent-insensitive, kanatype-insensitive, width-sensitive
Danish_Norwegian_CI_AI_KS 1252 0x00000406 0x00020003 Danish-Norwegian, case-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive
Danish_Norwegian_CI_AI_KS_WS 1252 0x00000406 0x00000003 Danish-Norwegian, case-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive
Danish_Norwegian_CI_AS 1252 0x00000406 0x00030001 Danish-Norwegian, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive
Danish_Norwegian_CI_AS_WS 1252 0x00000406 0x00010001 Danish-Norwegian, case-insensitive, accent-sensitive, kanatype-insensitive, width-sensitive
Danish_Norwegian_CI_AS_KS 1252 0x00000406 0x00020001 Danish-Norwegian, case-insensitive, accent-sensitive, kanatype-sensitive, width-insensitive
Danish_Norwegian_CI_AS_KS_WS 1252 0x00000406 0x00000001 Danish-Norwegian, case-insensitive, accent-sensitive, kanatype-sensitive, width-sensitive
Danish_Norwegian_CS_AI 1252 0x00000406 0x00030002 Danish-Norwegian, case-sensitive, accent-insensitive, kanatype-insensitive, width-insensitive
Danish_Norwegian_CS_AI_WS 1252 0x00000406 0x00010002 Danish-Norwegian, case-sensitive, accent-insensitive, kanatype-insensitive, width-sensitive
Danish_Norwegian_CS_AI_KS 1252 0x00000406 0x00020002 Danish-Norwegian, case-sensitive, accent-insensitive, kanatype-sensitive, width-insensitive
Danish_Norwegian_CS_AI_KS_WS 1252 0x00000406 0x00000002 Danish-Norwegian, case-sensitive, accent-insensitive, kanatype-sensitive, width-sensitive
Danish_Norwegian_CS_AS 1252 0x00000406 0x00030000 Danish-Norwegian, case-sensitive, accent-sensitive, kanatype-insensitive, width-insensitive
Danish_Norwegian_CS_AS_WS 1252 0x00000406 0x00010000 Danish-Norwegian, case-sensitive, accent-sensitive, kanatype-insensitive, width-sensitive
Danish_Norwegian_CS_AS_KS 1252 0x00000406 0x00020000 Danish-Norwegian, case-sensitive, accent-sensitive, kanatype-sensitive, width-insensitive
Danish_Norwegian_CS_AS_KS_WS 1252 0x00000406 0x00000000 Danish-Norwegian, case-sensitive, accent-sensitive, kanatype-sensitive, width-sensitive
Norwegian_100_BIN 1252 0x00000414 0x00000000 Norwegian-100, binary sort
Norwegian_100_BIN2 1252 0x00000414 0x00000000 Norwegian-100, binary code point comparison sort
Norwegian_100_CI_AI 1252 0x00000414 0x00030003 Norwegian-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive
Norwegian_100_CI_AI_WS 1252 0x00000414 0x00010003 Norwegian-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-sensitive
Norwegian_100_CI_AI_KS 1252 0x00000414 0x00020003 Norwegian-100, case-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive
Norwegian_100_CI_AI_KS_WS 1252 0x00000414 0x00000003 Norwegian-100, case-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive
Norwegian_100_CI_AS 1252 0x00000414 0x00030001 Norwegian-100, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive
Norwegian_100_CI_AS_WS 1252 0x00000414 0x00010001 Norwegian-100, case-insensitive, accent-sensitive, kanatype-insensitive, width-sensitive
Norwegian_100_CI_AS_KS 1252 0x00000414 0x00020001 Norwegian-100, case-insensitive, accent-sensitive, kanatype-sensitive, width-insensitive
Norwegian_100_CI_AS_KS_WS 1252 0x00000414 0x00000001 Norwegian-100, case-insensitive, accent-sensitive, kanatype-sensitive, width-sensitive
Norwegian_100_CS_AI 1252 0x00000414 0x00030002 Norwegian-100, case-sensitive, accent-insensitive, kanatype-insensitive, width-insensitive
Norwegian_100_CS_AI_WS 1252 0x00000414 0x00010002 Norwegian-100, case-sensitive, accent-insensitive, kanatype-insensitive, width-sensitive
Norwegian_100_CS_AI_KS 1252 0x00000414 0x00020002 Norwegian-100, case-sensitive, accent-insensitive, kanatype-sensitive, width-insensitive
Norwegian_100_CS_AI_KS_WS 1252 0x00000414 0x00000002 Norwegian-100, case-sensitive, accent-insensitive, kanatype-sensitive, width-sensitive
Norwegian_100_CS_AS 1252 0x00000414 0x00030000 Norwegian-100, case-sensitive, accent-sensitive, kanatype-insensitive, width-insensitive
Norwegian_100_CS_AS_WS 1252 0x00000414 0x00010000 Norwegian-100, case-sensitive, accent-sensitive, kanatype-insensitive, width-sensitive
Norwegian_100_CS_AS_KS 1252 0x00000414 0x00020000 Norwegian-100, case-sensitive, accent-sensitive, kanatype-sensitive, width-insensitive
Norwegian_100_CS_AS_KS_WS 1252 0x00000414 0x00000000 Norwegian-100, case-sensitive, accent-sensitive, kanatype-sensitive, width-sensitive
SQL_Danish_Pref_CP1_CI_AS 1252 0x00000406 0x00030001 Danish-Norwegian, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive for Unicode Data
SQL_EBCDIC277_CP1_CS_AS 1252 0x00000406 0x00030000 Danish-Norwegian, case-sensitive, accent-sensitive, kanatype-insensitive, width-insensitive for Unicode Data

So the old Danish_Norwegian collation still exists, but two new 10.0 collations (Danish_Greenlandic_100 and Norwegian_100) were also added, bringing the total number of rows this query returns to 56.

Now accidentally, the SQLS folks managed to work around an interesting conceptual backwards compatibility issue, too.

I'll explain. :-)

Now not every version of Windows will necessarily have an updated Unicode version that would require an updated default table. Which means that in theory had there not needed to be a new default table that as simple Norwegian collation would have been needed to be added.

If you look at the Danish_Norwegian collation, you will see how it is associates with LCID 0x00000406, aka MAKELANGID(LANG_DANISH, SUBLANG_DANISH_DENMARK), so the new collation could be added with the different LCID and the old one left right in place -- thus no change in behavior would be required for the existing collation!

But I imagine that the name (Danish_Norwegian) was likely based on the random chance of alphabetical order, as was the LCID they chose. So there was a 50% chance that this behavior-preserving change would not have been possible and that the same LCID would have required a behavior change while the newly added LCID wouldn't....

Luckily, Vista took a long time to ship and some pretty hefty Unicode updates made it in. So it was a non-issue, since two new collations ended up being needed irregardless of this issue.

Of course as it turns out they will get lucky again on the Finnish/Swedish disunification whenever that happens, since Finnish_Swedish uses 0x0000040b, aka MAKELANGID(LANG_FINNISH, SUBLANG_FINNISH_FINLAND) -- so the Swedish change in some unknown future version would just be a single additional set of collations, rather than two sets.

Two near misses, though. Though perhaps this is a good time to bring up that it may not have been the best design in the world to do all of this unification work, given how things can disunify over time. And perhaps one day the disunification will fall on the wrong side of random chance....

 

This blog brought to you by å (U+00e5, aka LATIN SMALL LETTER A WITH RING ABOVE)


# Ben Bryant on Thursday, March 27, 2008 1:04 PM:

irregardless? I don't have a problem with it, just a bit surprised...

# Michael S. Kaplan on Thursday, March 27, 2008 1:16 PM:

It is a word I use from time to time to tweak language mavens. :-)


referenced by

2010/03/03 Nordic duck duck goose -- Bokmål, Bokmål, Bokmål, Nynorsk!

2009/02/18 In search of the Swedish Tipping Point....

go to newer or older post, or back to index or month or day