by Michael S. Kaplan, published on 2011/03/01 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/03/01/10134580.aspx
The question I got via email:
Hi Michael,
A random question for you. :D
There are some Greek characters that have the same weight in unisort.txt like the below case, and LCMapString actually return the same sort key.
0x1f12 15 10 85 2 ;Greek Small Epsilon With Psili And Varia
0x1f15 15 10 85 2 ;Greek Small Epsilon With Dasia And Oxia
What’s the story behind these chars?
It seems there are noticeably many more instances of these among Greek characters.
Thank you
Interesting, huh?
If you look at the two characters, you will get a hinmt of what is going on here.
Let's take a look:
ἒἕ
They look kind of the same, or similar, depending on the font you use. Let's zoom in to see differences more clearly:
ἒἕ
Yes, some differences there, obviously.
One suggestion I have heard before is that they actually represent slightly different traditions of showing the same Polytonic Greek text, and when it comes to collation, the benefit of finding the different traditions when looking for Greek words that use the different traditions is very beneficial. Because you will have an easier time finding things.
I have even parroted it once or twice over the years myself, this idea that GREEK SMALL LETTER EPSILON WITH PSILI AND VARIA and GREEK SMALL LETTER EPSILON WITH DASIA AND OXIA are connected that way.
It may be true, what with all the polytonic etc. stuff. It may be a reasonable idea, but it is not the original basis for the weights being the same.
For insight into the actual reason, let's take these letters and put them in Unicode Normalization form D, which will give us:
U+1f12 --> U+03B5 U+0313 U+0300
U+1f15 --> U+03B5 U+0314 U+0301
Let's look at the weights behind these characters:
0x03b5 15 10 2 2 ;Greek Small Epsilon
0x0300 1 0 13 0 ;Non-Spacing Grave Accent
0x0301 1 0 12 0 ;Non-Spacing Acute Accent
0x0313 1 0 70 0 ;Non-Spacing Comma Above
0x0314 1 0 71 0 ;Non-Spacing Reversed Comma Above
Do you see what's going on here? Simple addition!
2 + 13 + 70 == 2 + 12 + 71
If these weights didn't work out the same, then different Unicode Nornalization forms won't be treated like they are equal to each oher....
Now perhaps these two characters shouldn't be consdered the same, but it isn't like the are the only four combining characters -- there are hundreds of them. Can you imagine being able to detect every single one of these cases across every language? These two were probably largely by accident -- an accident of applying canonical equivalence, since U+1f12 and U+1f15 weren't even in the tables before Vista/Server 2008 and before that only existed in their "normalization form D" constructs....
Canonical equivalence helped make the weights what they were here, though.
So ...are they actually different? I have no idea, man. They're all greek to me.
John Cowan on 1 Mar 2011 10:37 AM:
In Ancient Greek the grave (varia) is just a variant of the acute (oxia) used in the last syllable of a word when another word follows. The breathings were used only on the first syllable of a word. So ἒ would only be useful in a monosyllabic word beginning with epsilon. The Greek dictionary lookup at Perseus shows no such words. The (obsolete) polytonic orthography of Modern Greek might have some, I don't know.
Nick Nicholas on 1 Mar 2011 6:22 PM:
The case where it will make a difference is eta: ἥ is the fem.nom.sg relative pronoun, ἤ (and by extension ἢ) is "or".
That said, in indexes, where this is most critical, grave is always normalised to acute (because they're words in isolation, and grave indicates that a word is followed by another word, rather than punctuation.) You'll *almost* never see a grave in a dictionary headword.
So while it's wrong, the default use case for sorting words is not affected by it.
Nick Nicholas on 1 Mar 2011 6:31 PM:
... But while the grave and acute are not by default contrastive, rough and smooth breathing definitely are. So if it weren't for the fact that indexes normalise graves to acutes, you would be in trouble.
This applies to pre-1960 polytonic for Modern Greek; around 1960 Modern Greek dropped graves completely.
There one or two word pairs differentiated by being acute-only vs unaccented in Ancient Greek: τίς "who?" (which always has an acute) vs τις "someone". In at least one dictionary, I've seen them differentiated at the headword level as τίς vs. τὶς, acute vs grave as opposed to acute vs unaccented. (The grave actually represented a neutral pitch grammatically, and it was originally marked on all unaccented syllables.) As far as I know, no such pairs involve an initial breathing: to Unicode's stupendous good fortune, the feminine definite article can only be ἡ, and not ἣ.
Michael S. Kaplan on 1 Mar 2011 8:17 PM:
Grave and acute alone are different -- it is only these unique combinations that lose the distinction....
Nick Nicholas on 15 Mar 2011 9:10 AM:
Posted more about this at my blog. Michael, feel free to correct misrepresentations of Microsoft... :-)
Michael S. Kaplan on 15 Mar 2011 9:20 AM:
No, I think you covered it quite nicely. Thanks for the thorough answer to my question! :-)