The PUA outside of Unicode

by Michael S. Kaplan, published on 2007/05/26 11:55 -04:00, original URI:

Colleague Aldo Donetti asked me:

Hi Michael, I was investigating a bug and it turns out that this character ‘’(U+E843) is in the Private use range but it is also part of the Chinese 936 codepage.

The issue is whether to consider characters in the private use area as valid characters in Identifiers (e.g. in VB/C#/WebService names/…) – I would not allow them but I’m not too familiar with that range so I’m double checking with you. At present it is not allowed (as weird as it may seem).


Now the Private Use Area is a part of Unicode that I have discussed before (ref: previous posts). In particular, I have talked about the relationship between the PUA and EUDC (End User Defined Characters) like in this post.

But an important thing to keep in mind is that the PUA is not just a Unicode thing.

In fact, all of the East Asian code pages have areas set aside for private use, and specifically intended for the kind of characters that EUDC is intended. The various ranges used (shown in the registry at HKLM\SYSTEM\CurrentControlSet\Control\Nls\CodePage\EUDCCodeRange) are:

Looking at U+E843 (which is definitely in the Unicode PUA, covered in the defined range above) and its code page 936 mapping to 0xFE7E, it just kind of makes sense that the various ranges map to each other -- where else could they really map to if not to each other?

But the behavior that does not allow them identifiers sounds like a very good one, that should not change. Because whether one is in the Unicode PUA or the PUA of a code page, one is not looking at good candidates for identifers....


This post brought to you by(U+e843, a code value in the Unicode Private Use Area)

no comments

referenced by

2010/01/22 From TTE to EUF: Possible?

go to newer or older post, or back to index or month or day