Thank Bob that there are no time machines!

by Michael S. Kaplan, published on 2007/01/24 03:01 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2007/01/24/1520365.aspx

Well, let me state for the record that I am glad that Miral has no time machine!

I mean, let's consider what this would mean for applications that may or may not support Unicode (like that one I built just recently that we are shipping soon!). I mean, how on earth would the case of a byte count that is randomly doubled depending on which version of a function is going to be handled here?

Thoughts about Microsoft shipping out that mythical "MS Time" product to folks like Miral? Don't try to rock me to sleep with bedtime stories like that -- I can't believe people wonder why I am up posting at 3am so many nights? :-)

It seems to me that the difference in mindset between the "bytes dammit" and "logical characters" camps comes down to expected market. For people who actually ship multilingual applications, or work in multibyte code page areas, the character concept makes sense and is the appropriate abstraction.

For those (primarily Americans) who are targeting in-house deployment of things like 3D graphics renderers or networking apps--something where Latin-1 character sets are overwhelmingly dominant and text is meaningless--bytes are still the natural (if limited) viewpoint. All THEIR words have a one-to-one mapping between bytes and characters, so why bother? Just pass the buffer size so you know you're not going to scribble on something accidentally.

For the record, I'm in the second class of programmers. (I prefer "character" counts though.) I do try to stay in Unicode where possible (though using TCHAR or Qt's QString rather than wchar_t), but the code I write is either UI-agnostic (things like display hacks) or exclusive to two dozen English speakers who don't put customer names into it. So when I fall off the wagon and use one of our many ANSI-based graphics libraries, I don't really feel that bad...

Bob, you may be missing the point here

This isn't about "bytes dammit" vs "logical characters", it's about "bytes dammit" vs "arbitrary 16-bit code units dammit". The latter only makes sense if you happen to be using UTF-16 or, worse, UCS2. If you don't know about Unicode encodings, don't care about them, or wish you didn't care about them, the "arbitrary 16-bit code units" camp has nothing to offer you.

I saw a saying from Turkey or maybe Eastern Europe somewhere today that sums this up, if I recall correctly it goes "It doesn't matter how far you've come down the wrong road, the only thing to do is turn back".

Hi Nick,

Um, HUH? UTF-32 has the same problem with strings made up of base letters plus combining characters. It has nothing to do with UTF-16 being a bad idea or a wrong one (a point on which I believe you and I disagree). Even if there were a bunch of UTF-32 functions in the Win32 API, this would be about people who wanted the byte count to be passed and that would still disagree with logical characters. So its a bug in all cases.

Perhaps UTF-8 is somehow "purer" for having only a byte count, but it is actually much harder to use for many operations, which is why it

does not tend to be the internasl processing format for most products.

What's so difficult Michael? Are you disputing the original contention that the Win32 APIs are unnecessarily confusing and arbitrary here? The thread this came from would have been a better place to do that.

My point (as was intended) is that very little of a typical program needs to care about this artefact of internal representation. Yet the WCHAR nonsense makes you worry about it because of allocation. You can't actually use a WCHAR to do anything meaningful, because it's not wide enough to put anything in it. So knowing that the OS vendor thinks strings come in units of two bytes doesn't help you much at all, it's just more trivia to remember.

We don't have to do thought experiments about alternatives because they really exist, and we observe that when we use bytes to track string length the trouble seen with "logical characters" in Win32 goes away, and programmers can treat strings as opaque structures of so-and-such many bytes for most of the program. Of course it would be also be nice if they were able to store the opaque structure in a file or send it over a network.

Now it so happens that the standard way to do this AND retain trivial source compatibility with code written for legacy encodings like iso-8859-1 or windows-1252 is to use UTF-8, but that isn't a surprise, it's why UTF-8 was created after all.

This is why I love abstract string types. It doesn't matter what the encoding of the string is, the string itself is an object. If you need to encode it specifically, you convert it to a specific encoding given an external byte buffer.

CFStrings, ftw! http://developer.apple.com/documentation/CoreFoundation/Conceptual/CFStrings/index.html

It gets rid of all that tchar, wchar, et cetera cruft that often doesn't matter until you're converting it.

As I said, plenty of people have found that using UTF-8 for string operations is fraught with complications, and life is much easier using UTF-16 or UTF-32. And since almost every issue that applies to UTF-16 also applies to UTF-32, taking all of this into account the only "wrong road" is to continually assume that anyone who went with UTF-16 has gone down the "wrong road."

Given the number of companies like Oracle and Sybase that used UTF-8 as a crutch to get to UTF-16 for many processing operations, it seems like other compasnies have come to the same conclusion.

"I mean, how on earth would the case of a byte count that is randomly doubled depending on which version of a function is going to be handled here?"

Oh, having an abstract string type would also fix that. Since you cannot get at the data directly, you'd always have to go through the string object calls, which would automatically handle changing the size. So it just wouldn't matter.

Yes, performance concerns are a very good reason to choose abstract/opaque strings.

With opaque types, you can optimize heavily for specific (yet common) situations. A great example is comparing equality. If lexical equality isn't needed, you can reject on hash rather than doing character by character comparison (which is much, much slower than checking the hash).

Another example is file strings. If you load strings from a file, it automatically means they're constant and immutable. You can also optimize for this by not loading the entire file into memory, but having a weak reference to the file inside the opaque strings.

GUIDs/UUIDs are another great example. You can use the actual numerical value of the GUID as the hash (very fast) or drop the machine specific part from the hash in an opaque data type.

Dropping the memory usage for a string can also increase performance. For example, making all overlapping characters in ASCII and unicode only occupy 8 bits, even when the string contains unicode characters can nearly halve memory consumption (compared to using double byte strings).

Too often people think of performance only in the context of doing it "right" in non-opaque/non-abstract schemes and forgot about all the things that can be done when you can store a little extra metadata in an opaque type.

Of course, if you're "stuck" in that kind of performance thinking, most opaque string implementations allow you to have an external no copy buffer.

Also, this relates well to my entire "paths are evil" doctrine. With opaque strings, you can store references to files without having to deal with paths, yet still do it inside a string object so everything works as needed (ie, convert it to a real path or some other data when you need to store/display it).

I dunno... I think the two are pretty close to equally nice, with byte count being slightly nicer.

Buffers would become sizeof(buffer) instead of sizeof(buffer)/sizeof(*buffer). Strings where you have lengths... how useful is the length in "characters" anyway? What does it mean? Actually traversing strings and buffers works nicer with a pair of pointers. I guess the character count would be better for declaring buffers.

"I mean, how on earth would the case of a byte count that is randomly doubled depending on which version of a function is going to be handled here?"

Multiply by sizeof(*whatever), just like you have to occasionally divide by it now. Or just never get the unmultiplied length to begin with.

What I was talking about was the idea of supporting the time machine idea, so there would be no need for Win32 to have to be C strings. So if Miral did have a time machine, he could go back and push for an opaque data time. Thereby making all of this C string stuff completely moot (much like HWND is an opaque data type).

Unless I am misunderstanding what you mean, an opaque data type works fine in C (see HWND, HMENU, well all the handle stuff, et cetera). Or did I misunderstand what you meant by"supporting C"?

Again, this is all in the hypothetical in the situation that Miral had a time machine, could go back, and could somehow make the bucket of bytes style of strings in Win32 disappear and still capable of supporting unicode (which would also mean no A and W versions for a lot of functions would be needed).

FWIW, the CFString (a CFStringRef is a void*, equivalent to the HANDLE datatype on Win32, where Ref==HANDLE in context) stuff makes a horribly horrible assumption that a character is always UTF-16, which causes issues for UTF-32 code points.

I'm not sure I understand how this would affect backwards compatibility. After all, unicows.dll only works on Windows 95 and later, does it not? It was released in 2001, correct? And there were no W versions of functions on Windows 95, right?

So just imagine instead of ever implementing unicows.dll as it was, or ever creating W versions of a function, all the effort was put into making the functions take opaque strings. Then the "A" versions of functions could just be a shim into converting the strings into opaque strings, much like they're just wrappers for the W versions in Windows XP.

The W versions were already "starting from scratch" for all intensive purposes in this case.

FWIW, Apple did exactly this. Made new APIs take a CFString object and deprecates the old functions that took a bucket of bytes. It's worked quite well. They started the transition back in Mac OS 8.1, back when the CFString versions of the non-CF functions were just wrappers around the bucket of bytes versions (In Carbon). In Mac OS X, the situation was reversed.

I do think it'd be possible to convince people to use an opaque object just using the HWND and HMENU examples if I had a time machine.

Rosyna: They (Microsoft, that is) have that already, it's called .NET.

It works pretty well, actually, at hiding the implementation details of strings. The abstraction really only leaks when you're interfacing with "other" (non-.NET) things (like files, p/invoke, sockets, etc).

Of course, as Michael says, it comes at the cost of performance - there is a fairly strong reliance on StringBuilders in .NET. At some point, the lowest level needs to use a "bucket of bytes." It's just where you choose to put the "lowest level" - Windows puts it at the Win32 API level, Mac OS puts it one level lower.