by Michael S. Kaplan, published on 2007/12/10 10:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/12/10/6719043.aspx
Previous posts in this series:
Thus far, I have almost entirely (well, ignoring the second post, which this one expands way beyond the 1-2 character difference implied there) been not focusing on the fact that there are issues beyond the default table.
Now in the last post, regular reader John Cowan asked:
Thorn, what happened to Thorn? It's properly the 27th basic letter of the Latin alphabet. And of course Yogh should sort with G, not with Z and Ezh; the whole point of Yogh was to discriminate between the G-like and Z-like forms.
If you look at the code points cited, there are only Ezh ones there -- but in the source file it happens to be calling U+0292 and U+01b7 the small and capital Yogh rather than the Ezh. I didn't even look at the code points, I just quickly characterized the comments when I looked at the list.
And of course if you look way up higher in the order the real Yogh (small and capital appear more like where one would expect them).
Very weird, yes -- luckily the letters themselves are right -- as was the gadget that gave me character names and links! :-)
The Thorn is by default treated as being an EXPANSION and it becomes the letters "TH" or "th" -- so try not to make that the last letter of a range, unless it is run under Icelandic or something....
But in the end, the issue here is the way characters are added to the table. I talked about this in How does Microsoft assign new collation weights?, and the delicate balance between taking letters not used by the languages that use the default table between:
does not really show a great deal of consistency across the last decade or so that it has been going on (in fact the only consistent rule when picking between these three choices is that if a choice does not apply it is not used!). Luckily as long as the letters are not used by locales that use the default table, this particular inconsistency won't ever make a difference for users (as long as the right collation is used -- if not, then a whole bunch of other stuff is broken too!).
The guarantees for letters that are used in the languages covered by the table they are in is much more predictable and consistent.
But when you think about language differences and how a whole bunch of language will take letters like Å, Ä, and Ö and place them at the end of the Latin script after Z, once again you are left with the overall foolishness of trying to to cover the alphabet with any kind of expression involving literals or a range.
If you ignore this one, then changing the collation will suddenly lose all kinds of characters!
Now I am almost ready to swoop in with the \w solution here, but I'll wait until tomorrow to do that, when I also explain why it won't work either. :-)
This post brought to you by Þ (U+00de, aka LATIN CAPITAL LETTER THORN)
John Cowan on 10 Dec 2007 12:45 PM:
An expansion? Eww. Gross. If you are dealing with a text that randomly veers from þ to th, you might as well just transliterate every þ to th to start with, rather than looking at a list of words with the þs mixed in with the th's.
It's not like there's any real conflict here, unlike the situation with the umlauted vowels. Icelandic is the only modern language that uses þ at all, and its place in that alphabet is firmly established: right after z (not at the end, actually). Historically both y and z were also additions to the original Latin alphabet, and they too were placed after the last existing letter, which was u at the time (the split between u and v came much later).
There are late mediaeval manuscripts that write þ and y exactly the same, and the convention is to transliterate it all to y, on the grounds that ye for the (as in "Ye Olde Shoppe") is much more readable than mþkþll for mykyll (modern dialect mickle 'much'). But transliteration rules are not the same as rules for sorting the letter þ itself.
(Note: Not having italics makes this post much less readable, I'm afraid.)
Michael S. Kaplan on 10 Dec 2007 1:05 PM:
I emboldened you, hope that helps. :-)
The longstanding behavior of U+00de/U+00fe is one I would not be eager to change after all this time since it is pretty widespread, even to the point of seeing it move a bit in languages that do something special with the th (e.g. Albanian, Vietnamese, Welsh).
The only collation that puts it anywhere different than that is Icelandic (ICELANDIC_CS_AS in SQL Server, for example).
Would it be fair to say this is kind of a thorn in your side? :-)
go to newer or older post, or back to index or month or day