Getting exactly ONE Unicode code point out of UTF-8

by Michael S. Kaplan, published on 2005/05/21 05:45 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/21/420708.aspx

Now this is a question that I would make into an interview question, if only there were some way to do all the setup work in time.

Unfortunately, unless the candidate is very knowledgable about internationalization on the way in, there is no way to get them all of the information to solve the problem in such a short time.

I figure that if it can't make a good interview question, it might at least make a good blog post! :-)

Anyway, the question is simple -- how can you get a single UTF-8 code point out of a stream of UTF-8 data?

There are many times you may want to do this -- like if you are looking for a UTF-8 BOM. But the real question is how can you do it....

(If you are going to test yourself then read no further and try to work out a solution, then come back to look at the rest of the post!)

Now legal (valid) UTF-8 will be between one and four bytes per code point. Thus if converting four bytes of UTF-8 to UTF-16, you could end up with between one and four UTF-16 code points. Using the distribution I pointed out yesterday:

now right away this may point out one method you can use -- just take four bytes and convert them into a four-WCHAR scratchpad with the MultiByteToWideChar function. Then ignore it all except for the first WCHAR, like this:

Now if this were an interview I'd expect that the candidate (who would have impressed me if she thought of this answer) would take care of the obvious issues like making sure that the UTF-8 buffer represented by lpsz was at least four bytes in size.

And I would then point out that a call to lstrlenA is not the best answer since that would walk the whole string when you only needed to walk at most four bytes of it. The easiest solution then would be to just write a mini lstrlenN-esque function or just walk a few bytes right there.

And then (if this were an interview) I would ask him to perhaps build this approach into a function that would handle the generic case of a function whose job is to "nibble" that first code point out.

But first I would ask her about how she would conditionally increment the pointer past that one character.

There might be an attempt at a blind alley as he tries to figure out how to determine how to determine the answer with one or two more calls to MultiByteToWideChar. I'd quickly hint him away from that idea, and point him back at that range table:

I can't beat bit-counting instructions on a processor that supports it, but let's assume that we also want portability! Another criterion is of course readability. CornedBee also doesn't cover the case where we start on a trail byte - you're only incrementing by 1 rather than finding the beginning of the next code point. Nor do you check for the null terminator.

If we're just considering the first nibble (were you hinting at this, Michael?) there are only 16 possibilities, much of which can be done in a lookup table or a switch statement.

My first cut did a switch on ( psz[0] & 0xF0 ), but the compiler didn't optimise that too well - 163 bytes, 74 instructions. My second try instead used ( psz[0] >> 4 ) which it actually did a rather better job of - 91 bytes, 40 instructions. These used a simple unrolled loop to find the next lead byte/single byte if we start on a trail byte.

I thought I'd try a different tactic - a table lookup. This came out even better at 63 bytes, 26 instructions, so I'm including it here:

const char* MoveToNextUTF8_3( const char* psz )
{
static const ptrdiff_t nextOffset[] =
{
/* 0000 */ 1,
/* 0001 */ 1,
/* 0010 */ 1,
/* 0011 */ 1,
/* 0100 */ 1,
/* 0101 */ 1,
/* 0110 */ 1,
/* 0111 */ 1,
/* 1000 */ -1,
/* 1001 */ -1,
/* 1010 */ -1,
/* 1011 */ -1,
/* 1100 */ 2,
/* 1101 */ 2,
/* 1110 */ 3,
/* 1111 */ 4,
};

// Special case - if NULL return same position
if ( *psz == '\0' )
return psz;

ptrdiff_t offset = nextOffset[ psz[0] >> 4 ];

if ( offset < 0 )
{
if ( ( psz[1] & 0xC0 ) != 0x80 )
return psz + 1;
if ( ( psz[2] & 0xC0 ) != 0x80 )
return psz + 2;

//if ( ( psz[3] & 0xC0 ) != 0x80 )
return psz + 3;
}
else
{
return psz + offset;
}
}

Compiled with VS.NET 2003, /Oxs, extracted using dumpbin:

mov ecx,dword ptr [esp+4]
mov al,byte ptr [ecx]
test al,al
jne 0000010B
mov eax,ecx
ret
0000010B:
movsx eax,al
sar eax,4
mov eax,dword ptr [eax*4]
test eax,eax
jge 0000013A
lea eax,[ecx+1]
mov dl,byte ptr [eax]
and dl,0C0h
cmp dl,80h
jne 0000013C
lea eax,[ecx+2]
mov dl,byte ptr [eax]
and dl,0C0h
cmp dl,80h
jne 0000013C
lea eax,[ecx+3]
ret
0000013A:
add eax,ecx
0000013C:
ret

It would probably be smaller if I hadn't unrolled the loop. Yep, goes down to 42 bytes, 18 instructions, and incidentally copes a little better with malformed UTF-8 (i.e. where too many trail bytes follow a lead byte). The change:

if ( offset < 0 )
{
do
{
++psz;
}
while ( ( psz[0] & 0xC0 ) == 0x80 );

return psz;
}

A shifting solution uses 26 instructions and 52 bytes and will cope with lead bytes that have more than 4 bits set. I made a number of errors while writing it, though... Final code:

const char* MoveToNextUTF8_5( const char* psz )
{
// Special case - if NULL return same position
if ( *psz == '\0' )
return psz;

if ( ( psz[0] & 0xC0 ) == 0x80 )
{
++psz;

while ( ( psz[0] & 0xC0 ) == 0x80 )
{
++psz;
}

return psz;
}

// while top bit is 1, increment offset

if ( ( psz[0] & 0x80 ) == 0 )
return psz + 1;

ptrdiff_t offset = 0;
char ch = psz[0];

while ( ( ch & 0x80 ) == 0x80 )
{
++offset;
ch <<= 1;
}

return psz + offset;
}

Here is my modification to Mike's code. It basically rearranges the test cases to make sure the cases that happen most of the time, use less CPU cycles than the cases than do not happen often. Moreover, I believe \0 is a valid one-octet codepoint Unicode, so I took out Mike's first line.

const char* MoveToNextUTF8(const char * psz ) {
const char * opsz = psz;
++psz; // at least one octet
if (*opsz & 0x80) {
++psz; // at least two octets
if (*opsz & 0x40) {
if (*opsz & 0x20) {
++psz; // at least three octets
if (*opsz & 0x10) {
++psz; // four octets
}
// only top 3 msb's are 1, three octets
}
// only top 2 msb's are 1, two octets
}
else { // only top 1 msb's is 1, scan for the next valid start
while ((*(++opsz) & 0xc0) == 0x80);
return opsz;
}
}
return psz; // msb is 0, one octet
}

Here is the listing generated by VC++ 2005

00000 8b 54 24 04 mov edx, DWORD PTR _psz$[esp-4]
00004 8b c2 mov eax, edx
; Line 9
00006 8a 08 mov cl, BYTE PTR [eax]
00008 42 inc edx
00009 84 c9 test cl, cl
0000b 79 12 jns SHORT $LN3@MoveToNext
; Line 10
0000d 42 inc edx
; Line 11
0000e f6 c1 40 test cl, 64 ; 00000040H
00011 74 0f je SHORT $LL2@MoveToNext
; Line 12
00013 f6 c1 20 test cl, 32 ; 00000020H
00016 74 07 je SHORT $LN3@MoveToNext
; Line 13
00018 42 inc edx
; Line 14
00019 f6 c1 10 test cl, 16 ; 00000010H
0001c 74 01 je SHORT $LN3@MoveToNext
; Line 15
0001e 42 inc edx
$LN3@MoveToNext:
; Line 26
0001f 8b c2 mov eax, edx
; Line 27
00021 c3 ret 0
$LL2@MoveToNext:
; Line 22
00022 40 inc eax
00023 8a 08 mov cl, BYTE PTR [eax]
00025 80 e1 c0 and cl, 192 ; 000000c0H
00028 80 f9 80 cmp cl, 128 ; 00000080H
0002b 74 f5 je SHORT $LL2@MoveToNext
; Line 27
0002d c3 ret 0

Ron, you can't omit the test for the null terminator! If you do this you run off the end of the string. Yes, 0 is a valid UTF-8 value, but it decodes to U+0000, the null terminator.

I wasn't sure what the termination condition should be, but CharNext returns the current position if the current position is pointing to the null terminator. For consistency we'll go with that.

I miscounted - my table-based solution, with the loop to find the next character, only uses 17 instructions. It's 18 lines in the disassembly because one of the instructions wraps in dumpbin's output. There's a slight trade-off as the table will by default be on a different page (in the read-only data segment) to the code, so when the code is 'cold' it might take slightly longer for the first operation if the data needs to be paged in as well as the code. You can probably get around this using #pragma const_seg.

Just reading the disassembly again I've noticed that I have a bug! The default 'char' for the Microsoft compiler is *signed*, and is sign-extended to 32-bits before the shift (as char is promoted to int for arithmetic operations). Hence the offset ends up being *negative* if the top bit is set (i.e. for anything interesting, code units 128 - 255). To fix, cast to unsigned char before shifting, as rey did.

That'll teach me to post before testing properly.

UTF-16	1st Byte	2nd Byte	3rd Byte	4th Byte
00000000 0xxxxxxx	0xxxxxxx	-	-	-
00000yyy yyxxxxxx	110yyyyy	10xxxxxx	-	-
zzzzyyyy yyxxxxxx	1110zzzz	10yyyyyy	10xxxxxx	-
110110ww wwzzzzyy 110111yy yyxxxxxx	11110uuu	10uuzzzz	10yyyyyy	10xxxxxx