UTF-8 on a platform whose support is overwhelmingly, almost oppressively, UTF-16

by Michael S. Kaplan, published on 2010/11/24 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/11/24/10095816.aspx

Pretend that the title has given you no clues as to what is about to happen, please.

The nature of the strings and the purpose of the cache isn't relevant to the topic of this blog, I'll just say that the strings aren't file paths but are potentially much longer than that.

Anyway, the cache itself had certain limitations which amount to a maximum size per string, and the strings themselves can be visible to the user.

If you are a regular long-time reader here then at this point, your first thought may be the same as mine was -- the lessons from my whole UCS-2 to UTF-16 series, in particular the blogs within it dealing with truncation and not changing the meaning/appearance of the strings in unexpected ways.

Under ordinary circumstances, that would handle it --somewhere between implementing nothing in that series and (to cover all of the locale-specific and linguistic issues) implementing all 110% of it, the truth lies (sadly enough in most cases the final decision ends up closer to the 0% than the 110%, but finding that sad is the occupational hazard of being me, something I wouldn't recommend if you can avoid it!).

In fact, they were looking for help since their attempted solution code was using IsDbcsLeadByteEx and CharNextExA, neither of which seemed to support UTF-8.

The problem here points to a different smaller series, the You may want to rethink your choice of UTF one:

In this particular case, the tripping point is in Part #3 -- by implementing a solution using UTF-8 on a platform whose support is overwhelmingly, almost oppressively, UTF-16.

But if one is using either native or managed code coming from Microsoft, then all of the rest of the goodies aren't available to you, since all of that is in UTF-16.

Maybe it's time to talk to someone about that implementation decision to use UTF-8 here?

Okay, most of the time in these situations if someone is talking to me at the point where they are asking the "Does CharPrevExA support UTF-8?" question, then my knowledge isn't the issue, and neither is my ability to make implementation suggestions.

At this point everything is already written and potentially already shipped in some other, lesser manner that is only now being looked at in order to try and fix this problem.

If the string is too long, one has to walk back a certain number of bytes, and that number can only be known after one knows a lot more about the characters....

Though thinking about the Microsoft developer interview question, how would you attack the problem?

And don't suggest including ICU, we don't currently do that, as I said yesterday!

OK, I'll give it a shot.

I would take advantage of the fact that every byte in a UTF-8 encoding can be put into one of six categories (depending on its value).

1. Values: 00-7F - Single byte character (SB)

2. Values: 80-BF - Trailing byte (TB)

3. Values: C0-DF - Leading byte of two byte character (LB2)

4. Values: E0-EF - Leading byte of three byte character (LB3)

5. Values: F0-F7 - Leading byte of four byte character (LB4)

6. Values: F8-FF - Illegal (ILL)

And every Unicode code point is encoding into UTF-8 in one of four forms.

1. SB

2. LB2 TB

3. LB3 TB TB

4. LB4 TB TB TB

then if I wanted a solution similar to CharNextExA, I would write a function something like this:

LPSTR UTF8CharNext(LPCSTR lpCurrentChar, UINT32 *pCodePoint);

My first though would to ask "Are you sure you need to know where the characters are ? As long as you don't *edit* the string you don't need to". In this case there's truncation so that counts as edition, but it's surprising the number of people who are convinced they need to do very complex things when in fact they just want to know where the string ends, and don't actually need to handle individual characters. "Stop caring" is surprisingly often the correct answer to "where are the characters".

My second would be to say "Do you realize even if CharPrevExA, or the like, supported UTF-8 that would still *not* be enough to do things properly in a large number of cases ?". The first obvious case is combining characters, but actually if you go deep inside things, clean truncation can only be done once you know in which language, or according to which locale rules it should be done. Only positions where the language rules say hyphenation can occur are definitively places where it's guaranteed nothing inappropriate at all can happen if you truncate.

So you need to define what is "good enough" for your need. Once you've done that I'd say most of the time the most appropriate is to convert the whole string to UTF16 using MultiByteToWideChar and then have the whole WIN32 i18n API available. Once you're done truncating, convert back to UTF8 with WideCharToMultiByte.

But if you define that truncating at code point boundary is good enough for you, then maybe you don't need to do that much. Any UTF8 char is unambiguously either a one byte sequence, the start of a longer sequence, or in the middle of a sequence. In fact, if the value is lower than C0, it's the start of a sequence (of maybe only one byte), else the middle of one. So you just need to go back, comparing each byte to C0, and can truncate just before as soon as you find one that's lower than that. Note that UTF8 is the *only* multibyte encoding that's that convenient and easy WRT finding where the code point boundaries are.