I swear the Latvian bug is fixed; it was fixed 4.5 years ago!

by Michael S. Kaplan, published on 2010/07/13 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/07/13/10037608.aspx

It had been happening for quite literally years.

People were complaining about it for years, I mean.

It was to do with the Latvian collation.

In Windows, in SQL Server, in .Net, in Office applications in general, in Access in particular.

Which basically means they didn't like the collation in Windows, since the other products were getting their data from Windows.

Now this had been going on for a long time.

Like since NT4 at least. Maybe even before that.

Actually I just checked. It was first this incorrect result back in NT 3.51, which is when the locale was added to Windows (it was not in NT 3.1 or NT 3.5).

It was therefore wrong in the locale since its birth in Windows.

Now some time around middle of December of 2005 (just before the point when half the company stops working for the holidays to use up their vacation time they are about to lose), one of these reports found its way to me.

The report was essentially as follows:

...pointed me to you for help with an issue we’ve found with Latvian sorting. The string sorting currently treats accented characters the same as unaccented ones, while he believes they should be separated. For example, the following strings currently get sorted like this:

a
ā
ābols
ala
auns
āzis
b

While he believes the order should be:

a
ala
auns
ā
ābols
āzis
b

Is this correct? Thanks!

Now as it turns out, this report was entirely correct.

While many other of the characters with diacritics on them that in Latvian would be expected to sort with primary distinction, the A Macron letters were not.

I indirectly talked about this very case around the end of the month in a blog entitled What's a secondary distinction?. Though I did not identify the language at the time.

It was Latvian.

Soon after, this bug was fixed in Vista. That fix eventually made its way into SQL Server 2008 with their new Latvian collation and .Net 4.0 with their collation update, etc.

The fix was simple enough and was based on data retrieved on the larger set of requirements than the one letter -- treat A Macron and E Macron the way C Hacek, G Cedilla, K Cedilla, L Cedilla, N Cedilla, S Hacek, and Z Hacek were being treated -- give them unique primary (alphabetic) weight, rather than treat them as being only "diacritic-ally" different from their unencumbered forms.

And then all was good, everyone was happy, everyone was pleased with Microsoft, and no one was complaining about the Latvian collation anymore.

Except for the fact that it wasn't. And they weren't. And they weren't. So they still were.

Now the complaints after the change in Vista was available fell into to some specific buckets:

Some people complained that our results were different from prior versions;
Some people complained that the new collation did not work though it turned out they were not using the new collation;
Some people complained about the new results but did so in ways that were either internally contradictory or did not match our known experts and existing standards;
Some people complained that it did not work right without giving details;
Some people complained that we were not conformant to the Latvian National Standard LVS 24: 1993 which you can look at here if you want to purchase it. It is in Latvian and English if that helps.

Now the first four categories required no changes though the reasons for that fact (and thus the response given) for each was different.

But that fifth case is quite interesting.

You see, at this point (starting with the new Vista results), we were almostentirely conformant to LVS 24:1993.

But there were four differences:

LVS 24: 1993 puts "O macron" and "o macron" in a unique alphabetic weight after "O" and "o".
LVS 24: 1993 puts "R cedilla" and "r cedilla" in a unique alphabetic weight after "R" and "r".
LVS 24: 1993 puts "Y" and "y" just after "I" and "i", kind of like Lithuanian does.
LVS 24: 1993 puts capitals come before small letters (like the Hungarian Technical Sort and unlike every linguistic sort we have for Latin script letters).

Now the first three differences in this list relate to old characters from the orthography of Latvian prior to some reforms that date back about June of 1946 (see here for more info on that).

Which just leaves the last one, and it is one that kind of puts Latvian in the same category as that Hungarian sort, described in blogs like Technically it *is* a hungarian sort and Coloring outside the lines in the a-ness of the Hungarian Technical Sort.

But note this is one that by report no one else does differently either (I have not tried every product in the world, but no one has yet claimed a product that works differently -- if you know of one, please share!). And the collation is consistent with our model in general.

And although people occasionally bring that up, usually they jump back on the A Macron/E Macron bandwagon and say those are wrong.

Pretty much universally they talk about Windows 2003 or Windows XP or Access 2003 or Access 2007 or .NET 1.0/1.1/2.0/3.0/3.5 or SQL Server's Latvian_CS_AS/SQL_Latvian_CP1257_CS_AS collations, all of which would be expected to show the wrong results.

Now I do not blame them here -- they are seeing incorrect results. In their shoes, I'd be unhappy too.

But it isn't my fault either (I not only did not introduce the original issue; I am also the guy who fixed it in Windows!).

And I have no control over the other products or when they get the fix. Or how.

Some test code you can try yourself:

using System;
using System.Text;
using System.Runtime.InteropServices; public class Test {
    public static void Main() {
        Console.Write("\u0061 vs. \u0101 w/NORM_IGNORESPACE on LANG_ENGLISH: ");
        Console.WriteLine(CompareStringW(0x0409, NORM_IGNORENONSPACE, "\u0061", -1, "\u0101", -1) - 2); 
        Console.Write("\u0061 vs. \u0101 w/NORM_IGNORESPACE on LANG_LATVIAN: ");
        Console.WriteLine(CompareStringW(0x0426, NORM_IGNORENONSPACE, "\u0061", -1, "\u0101", -1) - 2); 
        DumpSortKey("\u0061", 0x0000000, 0X0409); DumpSortKey("\u0101", 0x0000000, 0X0409); 
        DumpSortKey("\u0061", 0x0000000, 0X0426); DumpSortKey("\u0101", 0x0000000, 0X0426);
    }

    [DllImport("kernel32.dll", CharSet=CharSet.Unicode, ExactSpelling=true, CallingConvention=CallingConvention.StdCall, SetLastError=true)]
    private static extern int LCMapStringW(uint Locale, uint dwMapFlags, string lpSrcStr, int cchSrc, byte[] lpDestStr, int cchDest); [DllImport("kernel32.dll", CharSet=CharSet.Unicode, ExactSpelling=true, CallingConvention=CallingConvention.StdCall, SetLastError=true)]
    private static extern int LCMapStringW(uint Locale, uint dwMapFlags, string lpSrcStr, int cchSrc, StringBuilder lpDestStr, int cchDest); [DllImport("kernel32.dll", CharSet=CharSet.Unicode, ExactSpelling=true, CallingConvention=CallingConvention.StdCall, SetLastError=true)]
    private static extern int CompareStringW(uint Locale, uint dwCmpFlags, string lpstr1, int cch1, string lpstr2, int cch2); private static uint LCMAP_SORTKEY               = 0x00000400;
    private static uint NORM_IGNORENONSPACE         = 0x00000002;

    public static void DumpSortKey(string stIn, uint flags, uint lcid) {
        int siz = LCMapStringW(lcid, LCMAP_SORTKEY | flags, stIn, stIn.Length, (byte[])null, 0);
        if(siz > 0) { Console.Write(lcid.ToString("x8") + " "); DumpString(stIn); byte[] rgbyt = new byte[siz]; siz = LCMapStringW(lcid, LCMAP_SORTKEY | flags, stIn, stIn.Length, rgbyt, rgbyt.Length); DumpSortKey(rgbyt); } } public static void DumpSortKey(byte[] rgbyt) { foreach(byte byt in rgbyt) { Console.Write(byt.ToString("x2"));
            Console.Write(" ");
        }
        Console.WriteLine(); } public static void DumpString(string stIn) { Console.Write(stIn + ' ');
        foreach(char ch in stIn) { Console.Write("U+");
            Console.Write(((ushort)ch).ToString("x4"));
            Console.Write(" ");
        }
    }
}

That code will return the following in Vista, Server 2008, Windows 7, Server 2008 R2, and beyond:

a vs. ā w/NORM_IGNORESPACE on LANG_ENGLISH: 0
a vs. ā w/NORM_IGNORESPACE on LANG_LATVIAN: -1
00000409 a U+0061 0e 02 01 01 01 01 00
00000409 ā U+0101 0e 02 01 17 01 01 01 00
00000426 a U+0061 0e 02 01 01 01 01 00
00000426 ā U+0101 0e 03 01 01 01 01 00

You can try it yourself if you don't believe me. :-)

So anyway, for the record here is the actual data of all the letters reported as meaningful parts of the alphabet in modern Latvian (in Latvia):

0x0061 14 2 2 2 ;a
0x0041 14 2 2 18 ;A
0x0101 14 2 23 2 ;a Macron
0x0100 14 2 23 18 ;A Macron
0x0101 14 3 2 2 ;a Macron
0x0100 14 3 2 18 ;A Macron
0x0062 14 9 2 2 ;b
0x0042 14 9 2 18 ;B
0x0063 14 10 2 2 ;c
0x0043 14 10 2 18 ;C
0x010d 14 10 20 2 ;c Hacek
0x010c 14 10 20 18 ;C Hacek
0x010d 14 11 2 2 ;c Hacek
0x010c 14 11 2 18 ;C Hacek
0x0064 14 26 2 2 ;d
0x0044 14 26 2 18 ;D
0x0065 14 33 2 2 ;e
0x0045 14 33 2 18 ;E
0x0113 14 33 23 2 ;e Macron
0x0112 14 33 23 18 ;E Macron
0x0113 14 34 2 2 ;e Macron
0x0112 14 34 2 18 ;E Macron
0x0066 14 35 2 2 ;f
0x0046 14 35 2 18 ;F
0x0067 14 37 2 2 ;g
0x0047 14 37 2 18 ;G
0x0123 14 37 28 2 ;g Cedilla
0x0122 14 37 28 18 ;G Cedilla
0x0123 14 38 2 2 ;g Cedilla
0x0122 14 38 2 18 ;G Cedilla
0x0068 14 44 2 2 ;h
0x0048 14 44 2 18 ;H
0x0069 14 50 2 2 ;i
0x0049 14 50 2 18 ;I
0x012b 14 50 23 2 ;i Macron
0x012a 14 50 23 18 ;I Macron
0x012b 14 51 2 2 ;i Macron
0x012a 14 51 2 18 ;I Macron
0x006a 14 53 2 2 ;j
0x004a 14 53 2 18 ;J
0x006b 14 54 2 2 ;k
0x004b 14 54 2 18 ;K
0x0137 14 54 28 2 ;k Cedilla
0x0136 14 54 28 18 ;K Cedilla
0x0137 14 71 2 2 ;k Cedilla
0x0136 14 71 2 18 ;K Cedilla
0x006c 14 72 2 2 ;l
0x004c 14 72 2 18 ;L
0x013c 14 72 28 2 ;l Cedilla
0x013b 14 72 28 18 ;L Cedilla
0x013c 14 73 2 2 ;l Cedilla
0x013b 14 73 2 18 ;L Cedilla
0x006d 14 81 2 2 ;m
0x004d 14 81 2 18 ;M
0x006e 14 112 2 2 ;n
0x004e 14 112 2 18 ;N
0x0146 14 112 28 2 ;n Cedilla
0x0145 14 112 28 18 ;N Cedilla
0x0146 14 123 2 2 ;n Cedilla
0x0145 14 123 2 18 ;N Cedilla
0x006f 14 124 2 2 ;o
0x004f 14 124 2 18 ;O
0x0070 14 126 2 2 ;p
0x0050 14 126 2 18 ;P
0x0072 14 138 2 2 ;r
0x0052 14 138 2 18 ;R
0x0073 14 145 2 2 ;s
0x0053 14 145 2 18 ;S
0x0161 14 145 20 2 ;s Hacek
0x0160 14 145 20 18 ;S Hacek
0x0161 14 151 2 2 ;s Hacek
0x0160 14 151 2 18 ;S Hacek
0x0074 14 153 2 2 ;t
0x0054 14 153 2 18 ;T
0x0075 14 159 2 2 ;u
0x0055 14 159 2 18 ;U
0x016b 14 159 23 2 ;u Macron
0x016a 14 159 23 18 ;U Macron
0x016b 14 160 2 2 ;u Macron
0x016a 14 160 2 18 ;U Macron
0x0076 14 162 2 2 ;v
0x0056 14 162 2 18 ;V
0x007a 14 169 2 2 ;z
0x005a 14 169 2 18 ;Z
0x017e 14 169 20 2 ;z Hacek
0x017d 14 169 20 18 ;Z Hacek
0x017e 14 171 2 2 ;z Hacek
0x017d 14 171 2 18 ;Z Hacek

The entries crossed out in grey are the ones from the default table that are overridden in Latvian, and the ones in red are the ones that do the overriding. And the one in black are the ones that do not change between the default sort and the Latvian one.

Perhaps one day when all of the various products are updated with this data, for once and all this bug can be laid to rest.

But in the meantime, I fully expect keep getting dozens of emails a year (at a minimum) for a bug that was first added to Windows in the mid-90's and which was fixed in Windows in very late 2005/early 2006....

The bug is fixed.

Michael S. Kaplan on 13 Jul 2010 7:13 AM:

You can find yet another example of this report from customers here:

www.sqlmonster.com/.../Sorting-Latvian-text-alphabetically

and really all over the place....

Andris on 28 Oct 2010 7:54 AM:

Paldies!

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

referenced by

2011/06/08 Wait til you see my Õ (Ō), Latvian edition

2010/12/13 I swear the Romanian bug is fixed; it was fixed 4.5 years ago!

go to newer or older post, or back to index or month or day