Sometimes a WCHAR really *is* just a character....

by Michael S. Kaplan, published on 2007/01/24 08:01 +00:00, original URI: http://blogs.msdn.com/michkap/archive/2007/01/24/1520227.aspx


Yesterday in response to When is a character not a character?, reader Bart commented:

Maybe you should write a post about how the concept of a character in the sense of wchar should be deprecated for uses other then datastorage or maybe a codepoint.

And maybe explain how to handle the kind of characters this article is about and what sets them aside from 'normal' strings. (and maybe how to recognize them so that you can still do things like ReplaceStr)

Funny me, I thought the issues behind the first part of what he is talking about were kind of what I had been doing in all these posts like that one and this one and this one and a whole bunch of others. Hell, Raymond even covered this recently. :-)

Though I am not sure I agree that the best answer is to deprecate whole definitions here. I mean after all Definitions are context sensitive and since for most purposes programmers need to care about the old definition they have since it controls important aspects like memory allocation, maximum buffer sizes, and so on, telling them that their thinking is all wrong is just not such a good idea since for most part their thinking is spot on.

The fact remains that developers need to understand that taking one letter and adding a whole buttload of diacritics still gives you what is basically one character is some sense. One simply ought to be aware of these two definitions so that foolish things like doing one's own cursor movements in a control rather than letting Uniscribe data and so on do the work. Because sometimes ligatures ARE supposed to be thought of as 'single characters' and other times they are not. so it is best to let the system help with these typos of decisions rather than rolling your own since the OS has a a lot more data to work from.

I'll talk about the find/replace issue another day. :-)

 

This post brought to you by  (U+09bc, a.k.a. BENGALI SIGN NUKTA)


comments not archived

referenced by

2008/10/06 UCS-2 to UTF-16, Part 4: Talking about the ask

go to newer or older post, or back to index or month or day