It is only of SECONDARY importance

by Michael S. Kaplan, published on 2006/06/02 21:44 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/06/02/615509.aspx


Since the time I posted Why do we call w 'double u' -- doesn't it look more like a 'double v'?, I have had several people ask me about what is difference between v and w in Swedish/Finnish, or more specifically about how Microsoft implements it.

Now I have talked in the past about how distinctions work, in posts such as this one.

As always, we will head back to the sort keys for our answer!

v   U+0076   0e a2 01    01 01 01 00

w   U+0077   0e a2 01 03 01 01 01 00

Let's compare that to how things are in the US:

v   U+0076   0e a2 01    01 01 01 00

w   U+0077   0e a4 01    01 01 01 00

So, in English there is a primary distinction between the two letters, while in Swedish/Finnish there is a secondary difference -- what is often called a diacritic difference and that can be ignored with the NORM_IGNORENONSPACE flag to CompareString, etc. In that case, the two letters would be considered equal....

Now note what this "almost equal" thing does to a list, e.g. the following list (in English):

becomes in Swedish and Finnish:

Ok, so the above is all very known and straightforward.

Let's talk about some bugs now. :-)

First we will look at another letter in the Swedish/Finnish table -- ŵ, a.k.a. LATIN SMALL LETTER W WITH CIRCUMFLEX. Let's look at the two normalization forms and see what we get:

ŵ   U+0175
0e a2 01 12 01 01 01 00

ŵ   U+0077 U+0302
0e a2 01 13 01 01 01 00

Ok, looks like a small mismatch there (luckily fixed in Vista when all the work to line up both normalization forms was done).

But now let's look at a grey area -- other W-like things that are not seen in Swedish or Finnish often enough to have been suggested in the language-specific tables:

   U+1e98
0e a4 01 1a 01 01 01 00

   U+0077 U+030a
0e a2 01 1b 01 01 01 00

It makes sense, though. We did move the w which is a piece of the Form D character, but not the  which was not actually moved anywhere.

Do we fare better in Vista?

   U+1e98
0e a4 01 1a 01 01 01 00

   U+0077 U+030a
0e a2 01 1b 01 01 01 00

Not at present, it looks like (Beta 2).

But if you think about it, this perhaps makes some sense, especially if the character itself is not used even in loan words. Trying to take every W-ish thing and move it, even if it is not used in the language. Spreading this across all letters and all languages, it could become a pretty huge issue in rather short order -- there are a lot of letters.

Graham Asher talks about this issue in his document entitled Better Collation Rule Markup: a critique of Locale Definition Markup Language (he points it out for Turkic languages), though he actually does suggest a very different scheme to attempt to address this sort of problem, one that he is clearly in favoring of seeing addressed.

Personally, I tend to fall back on the "pass appropriate strings" argument, myself. People who feel very strongly about this sort of thing can normalize their text....

In other words, this other issue is also only of secondary importance, in the non-"collation pun" sense. :-)

 

This post brought to you by (U+1e98, a.k.a. LATIN SMALL LETTER W WITH RING ABOVE)


# Serge Wautier on 3 Jun 2006 5:59 PM:

FWIW, it is called 'double-v' in french (and pronounced as a v).

We, French speaking Belgians, are somewhat inconsistent: We call it double-v but pronounce it the English way, which sounds more like like 2 u's than 2 v's.

referenced by

2006/10/29 SQL Server: compatibility collations vs. Window collations

2006/06/02 Je, for sure, from Sweden.

go to newer or older post, or back to index or month or day