If working above U+FFFF is a problem n your program, then so is the basic stuff, too

by Michael S. Kaplan, published on 2007/10/23 10:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/10/23/5617352.aspx


So yesterday, the question asked on a Win3 programming alias inside Microsoft was:

I am looking for some document on Windows support for Unicode characters above 0xFFFF. How well does Windows support these code points?

Since each character can no longer be represented by one wchar_t, does the programmer need to do something special?

Does WideCharToMultiByte with CP_UTF8 work with these code points?  (Are we going to see UTF-8 encoded character of more than 3 bytes?)

I really get nervous when questions like this get asked. :-)

Mainly because generally speaking, everything should work just fine -- any app that can support U+006f U+0308 (ö -- o with combining umlaut) properly can support U+d840 U+dc00 (𠀀, the first Extension B ideograph).

Of course there are applicatuions that are screwing up that simple Latin script cae, though such applications have bigger problems than supplementary characters!

Most developers don't need to care, since the level they work at is buffer allocations and size parameters -- which deals with literal WCHAR caloues, not what the user thinks of as a character....

But one of the most important points about supplementary characters I did not mention in The basics of supplementary is that since the high surrogates and low surrogates are each in their own unique ranges that do not overlap with themselves or any other character, they are much easier to work with than all of the olf lead byte/trail byte stuff in MBCS where some trail bytes were also lead bytes, etc. It makes all of these things much easier to handle....

Dave then added another bit to the question:

What if the application does simple parsing for ‘ANSI’ type characters?  Examples: ‘/’, ‘.’, etc.  Ones commonly used in paths or command lines.  Unaware WCHAR parsing could accidentally pick up those characters in the 2nd half of a 4 byte character.  Is there an easy way to avoid this problem?

It seems that this could be a problem if UTF-8 characters can be in a filename. Or anything that can be interpreted as a path.

This would be avoided if UTF-8 is designed to exclude the “zero-extended ANSI characters” (for lack of a shorter name) from the 2nd half of a 4 byte character.

If this were true it would be more of a worry, but then it is hard to know why UTF-8 has to be included if you are on Windows where all the file paths are in UTF-16, not UTF-8...

In fact, given the issues I raised in The torrents of U+fffd (aka When security and conformance trump compatibility and reality), it is fair to say that UTF-8 is entirely inappropriate as an encoding to store file paths in (since there are a great many sequences that are legal in file paths but are illegal in Unicode and will be replaced with � (U+fffd, REPLACEMENT CHARACTER) and thus will be unreachable by any technology that uses the NLS or .NET code page/encoding conversion mechanisms.

Though if we ignore that for a moment and go to Dave's concern more directly, the byte layout for UTF-8 is very predictable (I discuss it a bit in posts like Don't want to convert all at once? Well, maybe you could just nibble? and Getting exactly ONE Unicode code point out of UTF-8). Things would actually be quite safe in UTF-8, were it not for the fact that so many valid file paths could never make it there in the first place....

So use "W" Win32 API functions and keep it in UTF-16 if you want things to work right here. Every attempt to convert out of it to UTF-8 (this may be a common operation in .NET? Need to check on this) can be disatrous for real world data. :-(

 

This post brought to you by 𠀀 (U+20000, the first CJK EXTENSION B ideograph)


# jmdesp on 23 Oct 2007 1:28 PM:

UTF-16 does not protect you from invalid sequences. If you have garbage inside your file path, you might as well encounter isolated surrogate that will be invalide in UTF-16 just as well as in UTF-8. It's just that the probability being lower it takes more character until you meet one.

# Michael S. Kaplan on 23 Oct 2007 1:53 PM:

I agree 100% that it was always invalid -- but it looked just fine on disk and it worked (it was only after the conversion that its invalidity was exposed).

# Andrew West on 23 Oct 2007 5:39 PM:

"any app that can support U+006f U+0308 (ö -- o with combining umlaut) properly can support U+d840 U+dc00 (𠀀, the first Extension B ideograph)."

Any app except Google Reader apparently, where your post ends abruptly with:

Mainly because generally speaking, everything should work just fine -- any app that can support U+006f U+0308 (ö -- o with combining umlaut) properly can support U+d840 U+dc00 (

# Mihai on 23 Oct 2007 8:41 PM:

Not to mention that looking for ASCII (not ANSI) in UTF-8 is not a problem. If you find it, then it is really what you where looking for (due to the structure of the UTF-8, which overlaps 100% with plain ASCII). No problems with / \ | . ? etc.

This is (probably) what pleases the advocates of UTF-8: they never had to handle anything beyond ASCII, or simple buffer manipulations.

Once you try going beyond that, they will have the same problems that UTF-16 users have with combining character or surrogates (not many).

But thing is: if you do that kind of advanced processing, UTF-16 users will have problems with surrogates and combining characters, while UTF-8 users will have problems with everything above 127.

Now, you tell me who has the advantage?

:-)

# mlippert on 24 Oct 2007 4:00 PM:

Michael,

I looked at the various links, but I'm still unclear as to what "characters" are legal in a file path, but aren't legal Unicode?

The only things I can think of would be surrogates out of order or alone, or perhaps undefined codepoints? But I can't imagine quite how they used to get created.

The bug that was linked to seemed to mention á ('a' with an acute accent) but I'm not sure why that would be an illegal character.

Mike L.

# Michael S. Kaplan on 24 Oct 2007 5:19 PM:

Unpaied surrogates, permanently reserved characters, all of these things are stripped in conversion and replaced with U+fffd.

# mlippert on 25 Oct 2007 4:24 PM:

Thanks.

So what you're saying is that those things (unpaired surrogates etc) are legal to NTFS because it accepts *any* value as valid, not treating the given path/filename string as a sequence of unicode characters but instead as a sequence of 16-bit words?

Well that would explain the issue you've been discussing.

# Michael S. Kaplan on 25 Oct 2007 4:38 PM:

Exactly -- a clear engineer's solution to the problem! :-)

# Nick Lamb on 27 Oct 2007 12:22 PM:

Here's a fun thought Michael, how many Windows programs do you think realise that valid filenames presented to them as an array of wchar_t are not actually guaranteed to be Unicode strings ?

As with any other failed expectation this has security implications. Do the relevant APIs warn about that ?

# Michael S. Kaplan on 27 Oct 2007 12:29 PM:

Not sure that has to do with THIS post, given that it is talking about valid input. Did you mean for this comment to go to a different post (like the one on UTF-8, maybe)?


go to newer or older post, or back to index or month or day