by Michael S. Kaplan, published on 2008/03/17 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/03/17/8273915.aspx
Please read disclaimer; content of Michael Kaplan's blog not approved by Microsoft!
Pronunciation based sorting for Traditional Chinese is commonly requested from people, and the requests fall within two broad categories:
Cantonese pronunciations -- The requests have been for Pinyin style orderings, based on the Cantonese pronunciations for the ideographs. As I mentioned in The Cantonese IME, however, there is no accepted single transliteration standard to map ideographs to Cantonese pronunciations, and accepting schemes like Jyutping would leave the system only of use to some people, and it is difficult to assess whether the benefits would outweigh the costs.
Mandarin pronunciations -- Again the requests have been for a Pinyin style ordering, but of pronunciations of Traditional Chinese ideographs. The data for this also does not exist (we only currently have Bopomofo pronunciation data for ~48566 ideographs, based on a Taiwanese standard that could be missing ideographs used in Hong Kong and Macao. And as I mentioned in Is it Macau or is it Macao?, it is unclear how close to expected results the pronunciations (provided by China are for the 62,289 ideographs they covered) in these other markets. Mandarin is Mandarin, obviously -- but it is unlikely that there is enough in the way of standards here to guide what the expected pronunciations to be.
Given the problems that seem to exist for stroke-based sorts (ref: How bad does it need to be in order to be not good enough, anyway?), perhaps I can be forgiven my skepticism for the ability of the PRC-provided Mandarin pronunciations to match what they would expect in Taiwan or Macao....
There are interestingly even some people in Taiwan who have expressed interest in a Mandarin-style ordering. This is theoretically easier to do by mapping Bopomofo to Pinyin if one can come up with agreed upon ways to do that mapping, e.g. taking the first twelve entries in the Bopomofo table which all have a Bopomofo pronunciation of ㄅㄚ (BOPOMOFO LETTER B + BOPOMOFO LETTER A), with the PRC-provided Pinyin provided for reference and to prove that using the data as i may not match expectations:
| Pinyin |
Now obviously one could take all 48,566 ideographs and use this information to produce Pinyin-esque letters, and if one goes further down the table other Bopomofo is there, including many that include tone marks:
|first||U+02c9||MODIFIER LETTER MACRON||ˉ|
|second||U+02ca||MODIFIER LETTER ACUTE ACCENT||ˊ|
|fourth||U+02bcb||MODIFIER LETTER GRAVE ACCENT||ˋ|
In the table of data, the first tone is never included, so mapping as follows is easy enough:
ㄅㄚ ---> ba1
ㄅㄚˊ ---> ba2
ㄅㄚˇ ---> ba3
ㄅㄚˋ ---> ba4
ㄅㄚ˙ ---> ba5
Thus it seems like it would quite easy to do the transformation.
So at least in Taiwan the resultant Pinyin-esque sort might be just what people are looking for, and whether that matches what other people expect would have to be determined to see if it would be a useful ordering in Macao or Hong Kong or parts of China where Traditional Chinese is preferred.
The overall problem of applicability in Hong Kong and Macao is really just another piece of the same puzzle that came up before -- without information it is hard to fathom how relevant the data would be.
Plus the political issues inherent with providing a Pinyin-esque sort for Taiwan because even with some people thinking it a good idea there may be just as many who could fear the long term consequences of such a thing, not to mention that it would bring differences between the PRC data and the Taiwan data into much sharper focus!
I think it might be interesting to have some people in Macao and Hong Kong look at all of the differences between the PRC Pinyin data and the transformed Bopomofo (such as the four examples I gave) and determine which one they would expect to be the most common pronunciation. If their expectations veered toward the Taiwan choice then a sensible Traditional Chinese Pinyin pronunciation could emerge for any/all countries using Traditional Chinese, though the characters needed for Mandarin (if any) in those other countries would probably also have to be added, too....
At this point I remind myself that I don't work on collation in Windows anymore, which does make pie-in-the-sky speculations thinking about future version support a less than useful endeavor....
It is fun to think about potential solutions to the problems from time to time, though.
This blog brought to you by ㄅㄆㄇㄈ (U+3105 U+3106 U+3107 U+3108, aka BOPOMOFO LETTERS B P M F)
# John Cowan on 17 Mar 2008 12:07 PM:
I'd be more inclined to trust Bopomofo pronunciation data than any other kind, for two reasons: (a) Bopomofo is widely used in education, where people care about orthoepy, and (b) it has never been politicized -- nobody has ever suggested replacing hanzi with Bopomofo, and all the Chinese-speaking governments agree on what the letters mean and how they are to be used.
In addition, Bopomofo is a complete and unambiguous representation of the phonemic distinctions of Mandarin, so you can mechanically transliterate it to correct hanyu pinyin, tongyong pinyin, Wade-Giles, Gwoyeu Romatzyh, Yale Mandarin, or whatever.
# Michael S. Kaplan on 17 Mar 2008 12:46 PM:
The part that is still politicizable (or at least variable between markets and regions) is when you look at Han with multiple possible pronunciations, choosing the most common one.
Well, that and the fact that if one moves to something Pinyin-esque, one has to choose which of those systems to use! :-)
Even though "everyone knows Bopomofo", generally speasking no one is trying to use it outside of Taiwan for a collation at this point....
Andrew West on 17 Mar 2008 7:26 PM:
"I'd be more inclined to trust Bopomofo pronunciation data than any other kind..."
My experience is that most Taiwan data tends to be dismally unreliable, and that appears to be born out by the short sample that Michael provides, where the ㄅㄚ "ba" readings for U+738C, U+3EA9 and U+3EAB look very suspect to me. If you add one more column to the table, showing Mandarin readings from Unihan (kMandarin field), you can see that the "ba" readings for these characters are not supported by the standard dictionaries:
U+516B 八 ㄅㄚ ba1 BA1
U+4EC8 仈 ㄅㄚ ba1 BA1
U+5DF4 巴 ㄅㄚ ba1 BA1
U+6252 扒 ㄅㄚ ba1 BA1 PA1 PA2
U+53ED 叭 ㄅㄚ ba1 BA1 BA5
U+738C 玌 ㄅㄚ qiu2 QIU2
U+6733 朳 ㄅㄚ ba1 BA1
U+3EA9 㺩 ㄅㄚ jiu4 JIU4 SE4
U+3EAB 㺫 ㄅㄚ qiu2 QIU2
U+5427 吧 ㄅㄚ ba5 BA5 BA1
U+5C9C 岜 ㄅㄚ ba1 BA1
U+23CA9 𣲩 ㄅㄚ ba1 --
There may be a few characters that have different Mandarin pronunciations in Taiwan compared with PRC (the pronunciation of 和 in the sense "and" as "he" in PRC but as "han" in Taiwan springs to mind), but I strongly doubt that U+738C, U+3EA9 and U+3EAB do not have a special Taiwan Mandarin pronunciation of ba.
But given that very many characters have more than one pronunciation, I wonder how useful a phonetic sort is. Take U+6A02 樂 (乐) for example, it has two equally common meanings with very different readings (along with a few more less common readings), either "happy" pronounced "le4" or "music" pronounced "yue4". Whichever pronunciation you sort it under it's bound to be wrong for about half of the sort items.
# Michael S. Kaplan on 17 Mar 2008 7:49 PM:
Hi Andrew --
Yes, this is the limitation of table based sort without an underlying dictionary to differentiate between multiple pronunciations. It is therefore not suspect so much as potentially misleading to people looking for the wrong one.
Though the sort is though of well by the people I know of from Taiwan, FWIW. I just don't know how much that would apply to people in Hong Kong and Macao....
# Daniel Cheng on 17 Mar 2008 10:55 PM:
Hong Kong and Macao use Cantonese in daily life,
and no body take transliteration seriously (among all transliternation method, the hong kong government "standard" suck most... it is inconsistent). Even if you can pick a "standard", no body really care.
Jim Kay on 19 Mar 2010 8:49 PM:
Bopomofo is complete (at least when used in Taiwan) but it certainly is not UNAMBIGUOUS! A single Bopomofo sequence, including tone mark, can easily map to over 100 different characters. (My current project is to develop a machine-readable Bopomofo index for my Far East dictionary so I'm looking for a mapping table.)
Michael S. Kaplan on 21 Mar 2010 2:15 AM:
It is unambiguous in relation to sound, though not (as you point out) to what Han it may refer to (many Han can have multiple pronunciations too!)....
go to newer or older post, or back to index or month or day