Getting exactly ONE Unicode code point out of UTF-8
by Michael S. Kaplan, published on 2005/05/21 05:45 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/21/420708.aspx
Now this is a question that I would make into an interview question, if only there were some way to do all the setup work in time.
Unfortunately, unless the candidate is very knowledgable about internationalization on the way in, there is no way to get them all of the information to solve the problem in such a short time.
I figure that if it can't make a good interview question, it might at least make a good blog post! :-)
Anyway, the question is simple -- how can you get a single UTF-8 code point out of a stream of UTF-8 data?
There are many times you may want to do this -- like if you are looking for a UTF-8 BOM. But the real question is how can you do it....
(If you are going to test yourself then read no further and try to work out a solution, then come back to look at the rest of the post!)
Now legal (valid) UTF-8 will be between one and four bytes per code point. Thus if converting four bytes of UTF-8 to UTF-16, you could end up with between one and four UTF-16 code points. Using the distribution I pointed out yesterday:
now right away this may point out one method you can use -- just take four bytes and convert them into a four-WCHAR scratchpad with the MultiByteToWideChar function. Then ignore it all except for the first WCHAR, like this:
WCHAR Scratch[4] = {0};
int cch = MultiByteToWideChar(CP_UTF8, 0, lpsz, 4, Scratch, sizeof(Scratch) / sizeof(WCHAR));
if(cch > 0) {
// Use Scratch[0] to do whatever you wanted to do here
}
Now if this were an interview I'd expect that the candidate (who would have impressed me if she thought of this answer) would take care of the obvious issues like making sure that the UTF-8 buffer represented by lpsz was at least four bytes in size.
And I would then point out that a call to lstrlenA is not the best answer since that would walk the whole string when you only needed to walk at most four bytes of it. The easiest solution then would be to just write a mini lstrlenN-esque function or just walk a few bytes right there.
And then (if this were an interview) I would ask him to perhaps build this approach into a function that would handle the generic case of a function whose job is to "nibble" that first code point out.
But first I would ask her about how she would conditionally increment the pointer past that one character.
Hmmm.... almost a brain teaser!
That cch value will be somewhere from 1 to 4, which gives some hints:
- If it is 4, you only have to increment by 1 byte;
- If it is 1, you have to increment by 4 bytes;
- If it is 2, you have to increment by 1, 2, or 3 bytes (and then the other code point would be 3, 2, or 1 bytes, respectively);
- If it is 3, you have to increment by 1 or 2 bytes (and then the other two code points would also be 1 or 2 bytes each).
There might be an attempt at a blind alley as he tries to figure out how to determine how to determine the answer with one or two more calls to MultiByteToWideChar. I'd quickly hint him away from that idea, and point him back at that range table:
U+0000 - U+007f 1 byte
U+0080 - U+07ff 2 bytes
U+0800 - U+ffff 3 bytes
U+10000 - U+10ffff 4 bytes (U+d800 U+dc00 - U+dbff U+dfff)
The candidate could then look at this table and figure out the easiest code method to figure out the answer based on it and that one code point.
Or, if she was going to really impress me and she asked (or even better if she already knew!) the way that the UTF-8 bits are laid out:
UTF-16 |
1st Byte |
2nd Byte |
3rd Byte |
4th Byte |
00000000 0xxxxxxx |
0xxxxxxx |
- |
- |
- |
00000yyy yyxxxxxx |
110yyyyy |
10xxxxxx |
- |
- |
zzzzyyyy yyxxxxxx |
1110zzzz |
10yyyyyy |
10xxxxxx |
- |
110110ww wwzzzzyy 110111yy yyxxxxxx |
11110uuu |
10uuzzzz |
10yyyyyy |
10xxxxxx |
She could do a tiny little bit of "bit nibbling" on the very first byte as a way to quickly know exactly how many bytes the first code point would need.
Now while I do consider both of these methods to qualify as both smart and clever as far as solutions go, this second method obviously has many advantages over the first in that you do not need to look at as much of the string -- you do not have to walk past the first byte. And no guessing games are needed with the return value of the function call, in fact you get to skip the function call entirely. Obviously a much cleaner solution, all the way around.
In fact, it is fun to think about the quickest code you would write to do that nibble, with the fewest number of assembly operations once the code was compiled. Anyone want to take a stab at this last part of how to make the faster solution as speedy as possible? :-)
This post is sponsored by "" U+feff (ZERO WIDTH NO-BREAK SPACE, a.k.a. the BOM, of course!)
# CornedBee on 21 May 2005 6:55 AM:
To make it return 1, you could add:
bgt done
bis $0 1 $0
done:
# CornedBee on 21 May 2005 6:56 AM:
On an Alpha processor (or a M68k if I remember correctly), I could use the bitcount instructions. Assuming $16 (that's the first function argument) contains the starting address of the string:
// Load first byte into temporary
ldb $1 0($16)
// Shift to left end of register
sll $1 56 $1
// Flip bits
not $1 $1
// Count leading zeros
ctlz $0 $1
ret
This is not perfect, because it returns 0 if the code point is 1 byte long. But it's fast.
# rey on 21 May 2005 8:08 AM:
But with a lookup table you could be fast and portable, and it won't break if you encounter an invalid character. Though I admit it's pretty boring :)
const int utf8_length_tab[] = [
1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1,
0, 0, 0, 0, 0, 0, 0, 0,
2, 2, 2, 2, 3, 3, 4, 0 ];
return utf8_length_tab[*(unsigned char *)utf8_str >> 3];
# Mike Dimmick on 21 May 2005 9:28 AM:
I can't beat bit-counting instructions on a processor that supports it, but let's assume that we also want portability! Another criterion is of course readability. CornedBee also doesn't cover the case where we start on a trail byte - you're only incrementing by 1 rather than finding the beginning of the next code point. Nor do you check for the null terminator.
If we're just considering the first nibble (were you hinting at this, Michael?) there are only 16 possibilities, much of which can be done in a lookup table or a switch statement.
My first cut did a switch on ( psz[0] & 0xF0 ), but the compiler didn't optimise that too well - 163 bytes, 74 instructions. My second try instead used ( psz[0] >> 4 ) which it actually did a rather better job of - 91 bytes, 40 instructions. These used a simple unrolled loop to find the next lead byte/single byte if we start on a trail byte.
I thought I'd try a different tactic - a table lookup. This came out even better at 63 bytes, 26 instructions, so I'm including it here:
const char* MoveToNextUTF8_3( const char* psz )
{
static const ptrdiff_t nextOffset[] =
{
/* 0000 */ 1,
/* 0001 */ 1,
/* 0010 */ 1,
/* 0011 */ 1,
/* 0100 */ 1,
/* 0101 */ 1,
/* 0110 */ 1,
/* 0111 */ 1,
/* 1000 */ -1,
/* 1001 */ -1,
/* 1010 */ -1,
/* 1011 */ -1,
/* 1100 */ 2,
/* 1101 */ 2,
/* 1110 */ 3,
/* 1111 */ 4,
};
// Special case - if NULL return same position
if ( *psz == '\0' )
return psz;
ptrdiff_t offset = nextOffset[ psz[0] >> 4 ];
if ( offset < 0 )
{
if ( ( psz[1] & 0xC0 ) != 0x80 )
return psz + 1;
if ( ( psz[2] & 0xC0 ) != 0x80 )
return psz + 2;
//if ( ( psz[3] & 0xC0 ) != 0x80 )
return psz + 3;
}
else
{
return psz + offset;
}
}
Compiled with VS.NET 2003, /Oxs, extracted using dumpbin:
mov ecx,dword ptr [esp+4]
mov al,byte ptr [ecx]
test al,al
jne 0000010B
mov eax,ecx
ret
0000010B:
movsx eax,al
sar eax,4
mov eax,dword ptr [eax*4]
test eax,eax
jge 0000013A
lea eax,[ecx+1]
mov dl,byte ptr [eax]
and dl,0C0h
cmp dl,80h
jne 0000013C
lea eax,[ecx+2]
mov dl,byte ptr [eax]
and dl,0C0h
cmp dl,80h
jne 0000013C
lea eax,[ecx+3]
ret
0000013A:
add eax,ecx
0000013C:
ret
It would probably be smaller if I hadn't unrolled the loop. Yep, goes down to 42 bytes, 18 instructions, and incidentally copes a little better with malformed UTF-8 (i.e. where too many trail bytes follow a lead byte). The change:
if ( offset < 0 )
{
do
{
++psz;
}
while ( ( psz[0] & 0xC0 ) == 0x80 );
return psz;
}
A shifting solution uses 26 instructions and 52 bytes and will cope with lead bytes that have more than 4 bits set. I made a number of errors while writing it, though... Final code:
const char* MoveToNextUTF8_5( const char* psz )
{
// Special case - if NULL return same position
if ( *psz == '\0' )
return psz;
if ( ( psz[0] & 0xC0 ) == 0x80 )
{
++psz;
while ( ( psz[0] & 0xC0 ) == 0x80 )
{
++psz;
}
return psz;
}
// while top bit is 1, increment offset
if ( ( psz[0] & 0x80 ) == 0 )
return psz + 1;
ptrdiff_t offset = 0;
char ch = psz[0];
while ( ( ch & 0x80 ) == 0x80 )
{
++offset;
ch <<= 1;
}
return psz + offset;
}
# Mike Dimmick on 21 May 2005 9:34 AM:
After reading rey's solution (which has the same problems as CornedBee's, ahem, and won't work because you're only shifting by three when you need four, you'll go off the end of your table) I realised that I didn't need to use -1 for the 10xx cases, I could use 0, then the == operator in the test rather than <. It makes no difference, it simply changes one instruction from jge to jne.
# Michael S. Kaplan on 21 May 2005 9:53 AM:
Mike is definitely on the track that my 'thought experiment' was on here, though he was able to get further than I did since was doing the actual compiles.
So far its looking like he's aced this 'interview' :-)
# ronab49 on 21 May 2005 9:44 PM:
Here is my modification to Mike's code. It basically rearranges the test cases to make sure the cases that happen most of the time, use less CPU cycles than the cases than do not happen often. Moreover, I believe \0 is a valid one-octet codepoint Unicode, so I took out Mike's first line.
const char* MoveToNextUTF8(const char * psz ) {
const char * opsz = psz;
++psz; // at least one octet
if (*opsz & 0x80) {
++psz; // at least two octets
if (*opsz & 0x40) {
if (*opsz & 0x20) {
++psz; // at least three octets
if (*opsz & 0x10) {
++psz; // four octets
}
// only top 3 msb's are 1, three octets
}
// only top 2 msb's are 1, two octets
}
else { // only top 1 msb's is 1, scan for the next valid start
while ((*(++opsz) & 0xc0) == 0x80);
return opsz;
}
}
return psz; // msb is 0, one octet
}
Here is the listing generated by VC++ 2005
00000 8b 54 24 04 mov edx, DWORD PTR _psz$[esp-4]
00004 8b c2 mov eax, edx
; Line 9
00006 8a 08 mov cl, BYTE PTR [eax]
00008 42 inc edx
00009 84 c9 test cl, cl
0000b 79 12 jns SHORT $LN3@MoveToNext
; Line 10
0000d 42 inc edx
; Line 11
0000e f6 c1 40 test cl, 64 ; 00000040H
00011 74 0f je SHORT $LL2@MoveToNext
; Line 12
00013 f6 c1 20 test cl, 32 ; 00000020H
00016 74 07 je SHORT $LN3@MoveToNext
; Line 13
00018 42 inc edx
; Line 14
00019 f6 c1 10 test cl, 16 ; 00000010H
0001c 74 01 je SHORT $LN3@MoveToNext
; Line 15
0001e 42 inc edx
$LN3@MoveToNext:
; Line 26
0001f 8b c2 mov eax, edx
; Line 27
00021 c3 ret 0
$LL2@MoveToNext:
; Line 22
00022 40 inc eax
00023 8a 08 mov cl, BYTE PTR [eax]
00025 80 e1 c0 and cl, 192 ; 000000c0H
00028 80 f9 80 cmp cl, 128 ; 00000080H
0002b 74 f5 je SHORT $LL2@MoveToNext
; Line 27
0002d c3 ret 0
# Mike Dimmick on 23 May 2005 8:12 AM:
Ron, you can't omit the test for the null terminator! If you do this you run off the end of the string. Yes, 0 is a valid UTF-8 value, but it decodes to U+0000, the null terminator.
I wasn't sure what the termination condition should be, but CharNext returns the current position if the current position is pointing to the null terminator. For consistency we'll go with that.
I miscounted - my table-based solution, with the loop to find the next character, only uses 17 instructions. It's 18 lines in the disassembly because one of the instructions wraps in dumpbin's output. There's a slight trade-off as the table will by default be on a different page (in the read-only data segment) to the code, so when the code is 'cold' it might take slightly longer for the first operation if the data needs to be paged in as well as the code. You can probably get around this using #pragma const_seg.
Just reading the disassembly again I've noticed that I have a bug! The default 'char' for the Microsoft compiler is *signed*, and is sign-extended to 32-bits before the shift (as char is promoted to int for arithmetic operations). Hence the offset ends up being *negative* if the top bit is set (i.e. for anything interesting, code units 128 - 255). To fix, cast to unsigned char before shifting, as rey did.
That'll teach me to post before testing properly.
# Anthony J. Mills on 23 May 2005 12:03 PM:
Quickest is not always the fewest number of operations, but it helps these days.
char* MoveToNextUTF8(char* psz)
{
if (*psz != '\0') while (*(++psz) & 0xC0 == 0x80) ;
return psz;
}
I mean, make it a __fastcall (I think) where the input/output are in EAX and you get:
push edx
mov dl, [eax]
jz @Exit
@Loop:
inc eax
mov dl, [eax]
and dl, 0C0h
cmp dl, 080h
je @Loop
@Exit:
pop edx
ret
10 instructions. Copes very very well with malformed UTF-8. Without ample justification I wouldn't make this code any longer or harder to read than this.
Christopher Yeleigton on 24 May 2009 2:49 PM:
All in all, the answer is that Windows API does not support this use case and you have to roll your own. Pathetic.
Michael S. Kaplan on 25 May 2009 10:54 PM:
????
The Windows API, which is not UTF-8 based, does not have UTF-8 helper functions, and that is what you find pathetic? Okay, if you say so. I assume you design your operating systems with as many superfluous functions as possible?
Yuhong Bao on 11 Feb 2010 8:33 PM:
"Assuming $16 (that's the first function argument) contains the starting address of the string: "
While it didn't exist in 2005, recent AMD processors has the LZCNT instruction which would be perfect for this. Unfortunately, there is no 8-bit operand version, which means you have to shift.
Please consider a
donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.
referenced by
go to newer or older post, or back to index or month or day