Surrogate pairs and binary (Ordinal) comparisons

by Michael S. Kaplan, published on 2005/02/11 08:08 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/11/371010.aspx

If you look at the description of the Ordinal sort in the CompareOptions enumeration in the .NET Framework, it seems pretty clear:

Indicates that the string comparison must be done using the Unicode values of each character, which is a fast comparison but is culture-insensitive. A string starting with "U+xxxx" comes before a string starting with "U+yyyy", if xxxx is less than yyyy. This flag cannot be combined with other flags and must be used alone.

Now when looking at Unicode characters on the Basic Multilingual Plane (BMP) this seems straightforward enough -- every character from U+0000 to U+FFFF is included and the binary comparison is done on the actual code point values. But what about the rest of Unicode?

All of the characters in Unicode from U+10000 to U+10FFFF must also be in some kind of order (as I mentioned in Comparison confusion: INVARIANT vs. ORDINAL, every code point is given weight in an ordinal comparison, even if it is not yet assigned in the standard. And there are two possible ways to handle these characters:

Characters can be sorted by their absolute Unicode code point even when they are actually two Unicode code points deep down (a surrogate pair), or
The code points in UTF-16 can be sorted themselves and the two code points that make a surrogate pair can be treated as two code points to sort

The text in the definition of CompareOptions.Ordinal is not very clear on which would be expected, and the definition seems to suggest that Option #1 is what has in fact been done.How else would one read

A string starting with "U+xxxx" comes before a string starting with "U+yyyy", if xxxx is less than yyyy.

if U+xxxxx and U+yyyyy were not handled the same way?

Unfortunately. this guess is incorrect -- the absolute UTF-16 code point values are used. So if one looks at a list like the following:

A U+0041 (LATIN CAPITAL LETTER A)
̣ U+0323 (COMBINING DOT BELOW)
� U+fffd (REPLACEMENT CHARACTER)
U+ffff (not a character in Unicode)
𐐀 U+10400 (DESERET CAPITAL LETTER LONG I)
𡔳 U+21533 (EXTENSION B IDEOGRAPH)

and sorts them using CompareOptions.Ordinal they will be sorted as the UTF-16 code points suggest, thus the order would be:

A U+0041 (LATIN CAPITAL LETTER A)
̣ U+0323 (COMBINING DOT BELOW)
𐐀 U+d801 U+dc00 (a.k.a. U+10400, DESERET CAPITAL LETTER LONG I)
𡔳 U+d845 U+dd33 (a.k.a. U+21533, EXTENSION B IDEOGRAPH)
� U+fffd (REPLACEMENT CHARACTER)
U+ffff (not a character in Unicode)

This probably could be clearer in the documentation, though it is not likely to affect people since an Ordinal comparison is by its very nature not meaningful for its actual ordering as it is for having an unambiguous order....

This post brought to you by "𡔳" (a.k.a. U+21533, an Extension B ideograph)

# Serge Wautier on 11 Feb 2005 6:31 AM:

It could have been worse: A rough WORD sort breaking the surrogate pairs !

# Dmitry Jemerov on 13 Feb 2005 11:19 PM:

Looks like you found a third bug by your blogging...
http://blogs.jetbrains.com/yole/archives/000035.html

If it's really a bug in .NET BCL, could you please send it to the appropriate place?

# Michael Kaplan on 13 Feb 2005 11:24 PM:

What I reported was not a bug (well, maybe a doc bug!).

What you report is a different issue, and if true it definitely sounds like bug -- I will forward it on. :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day