UTF-8 on a platform whose support is overwhelmingly, almost oppressively, UTF-16

by Michael S. Kaplan, published on 2010/11/24 07:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/11/24/10095816.aspx


I had some developers running an interesting scenario by me the other day.

They had some strings that were being kept in a cache.

Pretend that the title has given you no clues as to what is about to happen, please.

The nature of the strings and the purpose of the cache isn't relevant to the topic of this blog, I'll just say that the strings aren't file paths but are potentially much longer than that.

Anyway, the cache itself had certain limitations which amount to a maximum size per string, and the strings themselves can be visible to the user.

If you are a regular long-time reader here then at this point, your first thought may be the same as mine was -- the lessons from my whole UCS-2 to UTF-16 series, in particular the blogs within it dealing with truncation and not changing the meaning/appearance of the strings in unexpected ways.

Under ordinary circumstances, that would handle it --somewhere between implementing nothing in that series and (to cover all of the locale-specific and linguistic issues) implementing all 110% of it, the truth lies (sadly enough in most cases the final decision ends up closer to the 0% than the 110%, but finding that sad is the occupational hazard of being me, something I wouldn't recommend if you can avoid it!).

However, in this case there was an additional complication.

The string cache I mentioned? The strings were being stored in UTF-8.

In fact, they were looking for help since their attempted solution code was using IsDbcsLeadByteEx and CharNextExA, neither of which seemed to support UTF-8.

Very true, neither function does (I previously discussed this in Is CharNextExA broken?).

And now we have a ball game.

The problem here points to a different smaller series, the You may want to rethink your choice of UTF one:

In this particular case, the tripping point is in Part #3 -- by implementing a solution using UTF-8 on a platform whose support is overwhelmingly, almost oppressively, UTF-16.

Certainly one can crack the byte semantics of UTF-8 (you can use the information in Getting exactly ONE Unicode code point out of UTF-8 as a roadmap), and figure out code point boundaries.

But if one is using either native or managed code coming from Microsoft, then all of the rest of the goodies aren't available to you, since all of that is in UTF-16.

Maybe it's time to talk to someone about that implementation decision to use UTF-8 here?

Okay, most of the time in these situations if someone is talking to me at the point where they are asking the "Does CharPrevExA support UTF-8?" question, then my knowledge isn't the issue, and neither is my ability to make implementation suggestions.

At this point everything is already written and potentially already shipped in some other, lesser manner that is only now being looked at in order to try and fix this problem.

I don't take it personally. :-)

Things are now complicated though.

If the string is too long, one has to walk back a certain number of bytes, and that number can only be known after one knows a lot more about the characters....

There is no easy answer here.

Though thinking about the Microsoft developer interview question, how would you attack the problem?

And don't suggest including ICU, we don't currently do that, as I said yesterday!


Stuart on 24 Nov 2010 9:20 AM:

OK, I'll give it a shot.

I would take advantage of the fact that every byte in a UTF-8 encoding can be put into one of six categories (depending on its value).

1. Values: 00-7F - Single byte character (SB)

2. Values: 80-BF - Trailing byte (TB)

3. Values: C0-DF - Leading byte of two byte character (LB2)

4. Values: E0-EF - Leading byte of three byte character (LB3)

5. Values: F0-F7 - Leading byte of four byte character (LB4)

6. Values: F8-FF - Illegal (ILL)

And every Unicode code point is encoding into UTF-8 in one of four forms.

1. SB

2. LB2 TB

3. LB3 TB TB

4. LB4 TB TB TB

then if I wanted a solution similar to CharNextExA, I would write a function something like this:

LPSTR UTF8CharNext(LPCSTR lpCurrentChar, UINT32 *pCodePoint);

Joshua on 24 Nov 2010 2:36 PM:

They tell me "Don't truncate strings."

Therefore I conclude if its too long, bail the insert.

jmdesp on 25 Nov 2010 3:30 PM:

My first though would to ask "Are you sure you need to know where the characters are ? As long as you don't *edit* the string you don't need to". In this case there's truncation so that counts as edition, but it's surprising the number of people who are convinced they need to do very complex things when in fact they just want to know where the string ends, and don't actually need to handle individual characters. "Stop caring" is surprisingly often the correct answer to "where are the characters".

My second would be to say "Do you realize even if CharPrevExA, or the like, supported UTF-8 that would still *not* be enough to do things properly in a large number of cases ?". The first obvious case is combining characters, but actually if you go deep inside things, clean truncation can only be done once you know in which language, or according to which locale rules it should be done. Only positions where the language rules say hyphenation can occur are definitively places where it's guaranteed nothing inappropriate at all can happen if you truncate.

So you need to define what is "good enough" for your need. Once you've done that I'd say most of the time the most appropriate is to convert the whole string to UTF16 using MultiByteToWideChar and then have the whole WIN32 i18n API available. Once you're done truncating, convert back to UTF8 with WideCharToMultiByte.

But if you define that truncating at code point boundary is good enough for you, then maybe you don't need to do that much. Any UTF8 char is unambiguously either a one byte sequence, the start of a longer sequence, or in the middle of a sequence.  In fact, if the value is lower than C0, it's the start of a sequence (of maybe only one byte), else the middle of one. So you just need to go back, comparing each byte to C0, and can truncate just before as soon as you find one that's lower than that. Note that UTF8 is the *only* multibyte encoding that's that convenient and easy WRT finding where the code point boundaries are.

Yuhong Bao on 26 Nov 2010 7:49 PM:

Yea, IsDBCSLeadByteEx do not even make sense for character sets that is more than two byte per char.


go to newer or older post, or back to index or month or day