Code units vs. code points

by Michael S. Kaplan, published on 2005/08/12 17:39 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/08/12/451043.aspx


Dmitiri suggested:

You mentioned somewhere in the blog that the length of unicode strings passed to windows API calls is to be counted in code points. MSDN however usually mentiones 'number of TCHARs' or 'number of wide characters' (WideCharToMultiByte) or 'number of WORDs' (TextOut). This looks like code units to me in fact. To prove that, I did some testing by converting a string containing surrogate pairs to UTF8 using WideCharToMultiByte (on W2K and 2003) and checking how many characters it consumed (which is seen from what it returns). It has clearly shown that it counts the input string in code units. Any comments ? And is there any API at all that operates with code points ?

Yes, this was sloppy language on my part, although in my defense it is because I do not particularly care for the Unicode terminology here:

Code Point. Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D4b in Section 3.4, Characters and Encoding.)

Code Unit. The minimal bit combination that can represent a unit of encoded text for processing or interchange. The Unicode Standard uses 8-bit code units in the UTF-8 encoding form, 16-bit code units in the UTF-16 encoding form, and 32-bit code units in the UTF-32 encoding form. (See definition D28a in Section 3.9, Unicode Encoding Forms.)

If I say "UTF-16 code points" that ought to be good enough. :-)

But you are correct Dmitri -- they are technically code units, not code points....

 


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day