The PUA? P.U. !
by Michael S. Kaplan, published on 2005/09/26 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/09/26/473738.aspx
Not everything is in Unicode.
I mean, it is close, but there are still lots of things that are not there. They fall into two categories:
- Items that are not appropriate for encoding within Unicode, such as already encoded characters, logos, and other such things;
- Items that make sense to encode but have not been encoded yet.
There are some areas of the Unicode code space that have been set aside for supporting such situation. The title for them is the Private Use Area. From the Unicode Glossary:
- Private Use. Refers to designated code points in the Unicode Standard or other character encoding standards whose interpretations are not specified in those standards and whose use may be determined by private agreement among cooperating users.
- Private-Use Code Point. Code points in the ranges U+E000..U+F8FF, U+F0000..U+FFFFD, and U+100000..U+10FFFD. (See definition D12 in Section 3.5, Properties.) These code points are designated in the Unicode Standard for private use.
One the most important points is in that first definition: "whose use may be determined by private agreement among cooperating users"
This would seem to be no better than the hack font solutions that often go with non-Unicode solutions out there. And to be honest, it is not really any better at all, being basically a shtetl (of if one prefers, ghetto) for characters that are not in Unicode at this time.
Now that i think about it, I guess it is a little better than a font hack in that it does act like a shtetl and keeps the custom stuff out of pure character data that uses assigned Unicode code points.
But it also means that none of the Unicode property data, or fonts/font linking, or shaping engines can be used on big operating systems since they do not know what the characters are. Without arguing or even attempting to argue the semantic content of a term like "private use" it is just plain common sense that if it is in Windows, then it is hardly private....
And this is why (for example) the PUA is not considered "sortable" according to the IsNLSDefinedString function or the CompareInfo.IsSortable method, even though they are (in fact) given weight.
Web standards tend to think along the same lines, an the PUA is not considered acceptable for ideal use in identifiers in many contexts, such as XML. They are definitely second class citizens.
With that said, they will work. You have to build your own font, and you cannot hope to have the shaping done for you by Uniscribe and you have to build your own keyboard with a tool like MSKLC, but if you do all of that and arrive with no unrealistic expectations (and of course if you distribute your fonts and keyboards and so on to users yourself!) then things should work well or you.
But never forget that you are in the "do it yourself" or the "some assembly required" area of Unicode. The area where the only thing that is done for you is to give your characters a segregated space where they will not be mistaken for other, valid characters....
This post brought to you by "" (U+f8ff, a.k.a. the last PUA character on the Basic Multilingual Plane)
# Vorn on 26 Sep 2005 3:11 AM:
Linked image shows what I see on Safari.
I think it's kinda funny.
# Michael S. Kaplan on 26 Sep 2005 11:18 AM:
Hmmmm.... the PUA, taking a bite out of Apple? :-)
I wonder what font sticks *that* glyph into the PUA?
# Mihai on 26 Sep 2005 5:40 PM:
"I wonder what font sticks *that* glyph into the PUA?"
I think all Apple fonts.
If you check the Apple Roman code page (ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/APPLE/ROMAN.TXT
you will this comment:
# The following corporate zone Unicode character is used in this mapping:
# 0xF8FF Apple logo
and in the mapping table:
0xF0 0xF8FF # Apple logo
# Tom Van Hauwaert on 29 Sep 2005 4:29 AM:
I am computarising a symbol script that is used in dance. The script itself exists for many years, but users are still struggling with bitmaps and pen tablets when it comes down to using the symbols on a computer.
I agree that the PUA can be seen as the unicode version of hacked fonts. But the PUA helps me right now doing business as did the hacked fonts before unicode was supported.
Developers need some kind of sketch area. Since the unicode consortium is the only instance that can move private scripts out of the PUA and into another plane, I first need to prove them that the script is well defined, understood and agreed upon by the users community.
In the mean time I am stuck with the PUA. Not that I regret it, because the option of moving my script out of the BMP, will bring me lots of new problems. For example, font development. I did develop a font that supports my script and that also contains the basic Latin set. This allows users to open documents in notepad (single font for the whole document) and still see all (Latin) characters and symbols. When the symbols are moved out of the BMP, this will no longer be possible, because a true-type font is designed with only one plane in mind. The cmap table contains 0xffff elements and there is no way to tell the font that some glyphs are in plane 0 and other glyphs in plane 1. I am not sure whether open-type supports multiple planes, I didn't do much reading about it.
At first glance, the PUA was the perfect solution for me. Later, when I tried to input my PUA characters into Microsoft applications, I thought PUA? PU! Because nothing that works for normal characters works for PUA characters: ALT+NUM KEYS (automatic switch to simsun font), the Office international symbols add-in (no support for PUA and private fonts), charmap (because my symbols are too large to fit in the available space), etcetera.
Luckily, the keyboard layout creator doesn't complain about the PUA and works perfect. Unfortunatly, my script contains more then 1000 symbols and therefore I will end up with 10 keyboard layouts. Not the perfect solution, but it is a start and in the mean time I keep working on the TSF - which is so much harder to get working.
go to newer or older post, or back to index or month or day