by Michael S. Kaplan, published on 2006/06/02 21:44 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/06/02/615509.aspx
Since the time I posted Why do we call w 'double u' -- doesn't it look more like a 'double v'?, I have had several people ask me about what is difference between v and w in Swedish/Finnish, or more specifically about how Microsoft implements it.
Now I have talked in the past about how distinctions work, in posts such as this one.
As always, we will head back to the sort keys for our answer!
v U+0076 0e a2 01 01 01 01 00
w U+0077 0e a2 01 03 01 01 01 00
Let's compare that to how things are in the US:
v U+0076 0e a2 01 01 01 01 00
w U+0077 0e a4 01 01 01 01 00
So, in English there is a primary distinction between the two letters, while in Swedish/Finnish there is a secondary difference -- what is often called a diacritic difference and that can be ignored with the NORM_IGNORENONSPACE flag to CompareString, etc. In that case, the two letters would be considered equal....
Now note what this "almost equal" thing does to a list, e.g. the following list (in English):
becomes in Swedish and Finnish:
Ok, so the above is all very known and straightforward.
Let's talk about some bugs now. :-)
First we will look at another letter in the Swedish/Finnish table -- ŵ, a.k.a. LATIN SMALL LETTER W WITH CIRCUMFLEX. Let's look at the two normalization forms and see what we get:
ŵ U+0175
0e a2 01 12 01 01 01 00ŵ U+0077 U+0302
0e a2 01 13 01 01 01 00
Ok, looks like a small mismatch there (luckily fixed in Vista when all the work to line up both normalization forms was done).
But now let's look at a grey area -- other W-like things that are not seen in Swedish or Finnish often enough to have been suggested in the language-specific tables:
ẘ U+1e98
0e a4 01 1a 01 01 01 00ẘ U+0077 U+030a
0e a2 01 1b 01 01 01 00
It makes sense, though. We did move the w which is a piece of the Form D character, but not the ẘ which was not actually moved anywhere.
Do we fare better in Vista?
ẘ U+1e98
0e a4 01 1a 01 01 01 00ẘ U+0077 U+030a
0e a2 01 1b 01 01 01 00
Not at present, it looks like (Beta 2).
But if you think about it, this perhaps makes some sense, especially if the character itself is not used even in loan words. Trying to take every W-ish thing and move it, even if it is not used in the language. Spreading this across all letters and all languages, it could become a pretty huge issue in rather short order -- there are a lot of letters.
Graham Asher talks about this issue in his document entitled Better Collation Rule Markup: a critique of Locale Definition Markup Language (he points it out for Turkic languages), though he actually does suggest a very different scheme to attempt to address this sort of problem, one that he is clearly in favoring of seeing addressed.
Personally, I tend to fall back on the "pass appropriate strings" argument, myself. People who feel very strongly about this sort of thing can normalize their text....
In other words, this other issue is also only of secondary importance, in the non-"collation pun" sense. :-)
This post brought to you by ẘ (U+1e98, a.k.a. LATIN SMALL LETTER W WITH RING ABOVE)
# Serge Wautier on 3 Jun 2006 5:59 PM:
referenced by
2006/10/29 SQL Server: compatibility collations vs. Window collations
2006/06/02 Je, for sure, from Sweden.