How long is that non-Unicode string?
by Michael S. Kaplan, published on 2006/07/15 13:05 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/07/15/666717.aspx
The question that came to me by email from a woman named Denise was:
What is the easiest way to get the length of an MBCS string in characters, by which I mean not in bytes but having an ideograph be treated as length 1? Assuming you know the code page the string is in, and it is in one of the regular Windows cod pages.
This is actually kind of like an interesting interview question, and gives rise to many possible approaches....
So, in the spirit of interview questions I have asked in the past, I'd like to put this question out there for everyone!
Now of course since it is an "interview" question, the best answer is the one that answers the question while properly balancing concerns like
- performance
- memory usage
- maintainability
- understandability
- while of course solving the actual problem -- the length in characters, not bytes, of a CJK string in a given code page
In keeping with Denise's original formulation of the question, let's limit the code page range to the ANSI/OEM code pages (thus 932, 936, 949, and 950).
This time I am going to try not restricting the comments (other then the usual restricting of anonymous comments), to see what happens....
Ready, set, go!
# Sebastian Redl on 15 Jul 2006 3:11 PM:
Well, there's StringCchLength. However, while MSDN explicitely says that it returns the character count (as opposed to StringCbLength), it does not say what code page is used; thus, it is either the OEM page or the current ANSI page. If it's the latter, as I would assume, you could set the current ANSI page to your desired page, call the function, then set it back. Neither does the general information page about strings offer any information about what code page is used. (While we're talking about encodings, ironically the online MSDN page in question, "About Strings", reports itself to be windows-1252, but gives the example umlaut-u (ü) for CharLower in UTF-8, resulting in mangled display.)
There seems to be no function that takes an LCID or similar identifier that could be used to select a specific code page.
# Michael S. Kaplan on 15 Jul 2006 3:37 PM:
StringCchLength will not return the right results here....
Plus you have knowledge of a code page and need to know how to get the right answer no matter which of the four code pages it is in.
On top of everything else, since a reboot is required for default system locale changes, it would never be the fastest function you could run. :-)
# Adam on 15 Jul 2006 8:20 PM:
Are we allowed to use an external library to help? If so, I choose iconv[0]. It's part of POSIX and available on most Unices and MacOS X, but GNU iconv[1] is also available for win32[2][3].
It does supports the required codepages; 932, 949 and 950 are all available as "CP932", "CP949" and "CP950", while 936 is available under the name GBK. So...
int stringCharLen(char const * src, int codepage)
{
char const * pagename;
switch (codepage) {
case 932: pagename = "CP932";
case 949: pagename = "CP949";
case 950: pagename = "CP950";
case 936: pagename = "GBK";
default: return -1;
}
size_t srclen = strlen(src); /* length of src _in bytes_ */
if (srclen == 0) {
return 0;
}
int chars = 0;
iconv_t cd = iconv_open("UTF-32", pagename);
while (1) {
char dest[5];
size_t destlen = 5;
size_t result = iconv(cd, &src, &srclen, &dest, &destlen);
if (result != (size_t) -1) {
/* no error occurred. Must have just recoded last character. */
++chars;
break;
}
else if (errno == E2BIG) {
/* just recoded one character that wasn't the last one. */
++chars;
}
else {
/* oops! */
chars = -1;
break;
}
}
iconv_close(cd);
return chars;
}
There's not much error checking, and I don't really like the very-simple switch to get charset names from codepages, but...
a) It ought to work (why is that _last_ on your list of requirements? :-)
b) It doesn't require you to fiddle with the current codepage.
c) There's not much to maintain & that ought to help with its clarity.
d) Should have low mem usage.
Downsides are:
a) Performance won't be great. I could have used a longer "dest" buffer and counted the number of characters put into it instead of doing one at a time, but that would have been a bit more complicated. Besides which, optimization is what you do last, if and when you've figured out where the host spot is.
b) Requires a third-party dll. Is that a big deal? *shrug*
[0]
http://www.opengroup.org/pubs/online/7908799/xsh/iconv.h.html
[1]
http://www.gnu.org/software/libiconv/
[2]
http://gettext.sourceforge.net/
[3]
http://ftp.gnu.org/pub/gnu/libiconv/libiconv-1.9.1.bin.woe32.zip
# Michael S. Kaplan on 15 Jul 2006 8:29 PM:
Ok, not too bad -- though that is a lot of separate calls to do the conversions, which might be a bit slower than one might like.
You could do something similar with Win32 calls, too -- which would remove the external libraries issue completely.
But what about a way to avoid the multiple calls to the conversion libraries? That call may be a bit too heavy here....
(as to why I put it last on the list, it is kind of implied and is just a reminder -- the optimizations would be in the other points, of course. <grin>)
# Adam on 15 Jul 2006 9:01 PM:
You can't grab a buffer 4 times the size of the source string and do it all at once if you've no idea how big the source string may be. If it's a multi-megabyte text file that someone's decided to run through your code, you don't want it taking up about 5 times the amount of memory it should (worst case).
So, to reduce the number of calls to iconv() you need to do the recoding in chunks (say, with a 4k dest buffer instead of a 4 byte dest buffer), which is more complex. For proof-of-concept code like the above, I'll stick with the brute-force approach. Especially when I've not even tried to compile it :)
# Vlad on 15 Jul 2006 9:16 PM:
Here is a solution with no libraries at all. It should be reasonably fast for correctly encoded CJK string
inline bool isLeadByte(unsigned char c, int cp){
if ((c < 0x81)||(cp == 932 && c > 0xA0 && c < 0xE0 ))
return false;
else
return true;
}
int countChars(const char *str, int cp){
int count = 0;
unsigned const char *ptr = (unsigned const char *)str;
while(*ptr != 0){
ptr += isLeadByte(*ptr,cp)? 2 : 1;
count++;
}
return count;
}
# Michael Dunn_ on 15 Jul 2006 10:58 PM:
// Written in the comment area's little text box, totally untested ;)
size_t GetCharCount ( LPCSTR p )
{
size_t cch = 0;
while ( 0 != *p )
{
if ( IsDBCSLeadByte(*p) && 0 != p[1] )
{ cch++; p++ }
else
cch++;
p++;
}
return cch;
}
There are two cases, *p is a DBCS lead byte or it isn't. If it's not, then increment the character count. If it is, and the subsequent byte is non-zero (meaning it's a valid trail byte), then increment the count and increment p over the lead byte.
The test of p[1] != 0 handles the case of a lead byte being at the end of the string, with no trail byte.
This uses just two locals, which can probably be put in registers. Almost everything is an increment or compare against zero, which are fast instructions.
The only concern I'd have for maintainability is the p[1] check, but a couple lines of explanatory comments would take care of it.
# Adam on 16 Jul 2006 4:54 AM:
# Vlad on 16 Jul 2006 5:52 AM:
Next try , after rediscovering CharNextExA.
- This version is maintainable, but would require a function call for each DBCS character, so it is slow
#include <windows.h>
int countChars(const char *str, int codepage){
int count = 0;
const char *p;
for(*p = str; *p ; p = CharNextExA(codepage,p,0))
count++;
return count;
}
# KJK::Hyperion on 16 Jul 2006 7:17 AM:
Why, of course you use the same, efficient, flawed method cmd.exe uses to parse batch files under a multibyte codepage: you call GetCPInfo on the desired codepage (in the case of cmd.exe, the current console output codepage, which typically defaults to the current OEM codepage), and this fills you a nice CPINFO structure you can cache until the codepage changes
Counter-question: how, exactly, is this flawed? (answer: it breaks with any multibyte codepage that doesn't use the lead byte/trail byte model, most notably UTF-8. Try chcp 65001 in a cmd session, see what happens when you try to run a batch file after that)
If you don't mind linking to user32.dll or killing Windows NT 3.x support (call me old fashioned), you could always use CharNextExA, altough the documentation is extremely vague as to what constitutes "next", and I have no idea whether it would work with UTF-8 or not
# Michael S. Kaplan on 16 Jul 2006 7:47 AM:
Hello Mike,
Your solution with IsDBCSLeadByte is intriguing, though as Adam mentioned, IsDBCSLeadByteEx would be better since it takes a code page parameter.... :-)
# Michael S. Kaplan on 16 Jul 2006 7:49 AM:
Hello Vlad and KJK:Hyperion,
CharNextExA would work here, and does meet the requirements (and thankfully is a lighter call that the conversion call). The UTF-8 thing is not an issue since that is not in the current question. :-)
Though of course to do its work, CharNextExA uses IsDBCSLeadByteEx calls, so using it directly may be more efficient....
# Adam on 16 Jul 2006 11:46 AM:
KJK: One advantage of my approach that I didn't mention - by simply appending to the switch at the top to get the encoding name from the codepage, adding support for other MBCSs (including UTF-8) and also SBCSs (like 1252) is trivial. It's also possible to add support for dealing with UTF-16 surrogate pairs to count unicode chars outside the BMP properly, as MS has assigned code pages 1200 and 1201 to UTF-16 LE and BE.
# Mihai on 16 Jul 2006 9:19 PM:
"CharNextExA would work here"
Unless you run it on Win XP, where is broken :-)
I can see two ways:
- count chars skiping lead bytes (IsDBCSLeadByteEx)
while( *str ) {
if( !IsDBCSLeadByteEx( cp, *str ) )
++count;
++str;
}
- "pretend" to convert to Unicode using MultiByteToWideChar (NULL output buffer and 0 output len). This will return the number of WCHARs needed (which is the answer)
BUT!
1. the original question was about MBCS, not DBCS. True Michal reduced the problem to the ANSI code pages. But if you have to also support GB-18030 (and you must, if you want to sell in China :-), then the first solution is bad.
And the second should really do the conversion and count the surrogate pairs as one character
2. Japanese is trickier. If the string contains narrow katakana, then some of them will need two elements. So sometimes you need 4 bytes for a narrow katakana. This error is there even after converting to Unicode.
Example: CA DF (SJIS) converts to <U+FF8A,U+FF9F>, Halfwidth Katakana Letter Ha + Halfwidth Katakana semi-voiced sound mark.
But for the user this is Katakana Leter Pa (U+30D1), one character.
The solution is to use LCMapStringEx with LCMAP_FULLWIDTH first.
3. If the real request is "count what the user percieves as character", it might be even more problematic. What is a "character" for a native Korean? The bundle of 2-4 sylables that are encoded in the Hangul Syllables block, is counted as one character, or each syllable should be counted? Is 걱 (U+AC71) one or 3 characters? I don't know, I should ask a native.
But if the answer is the second, then we need decomposition info.
And we still cannot blindly decompose, because this will also decompose the Japanese Kana that are present in the Korean code page! Probably decompose all, follwed by LCMapStringEx with LCMAP_FULLWIDTH.
But what about Russian characters (like cp949<AC D7>, mapping to U+0451, which also gets decomposed and is not fixed by LCMapStringEx.
No easy solutions here :-)
Ouch, my head hurts, so I will stop :-)
# Michael S. Kaplan on 16 Jul 2006 10:15 PM:
CharNextExA is not broken on XP, as far as I know -- only CharNextW is.... or is this some other issue?
I never make people solve new problems unless I expand the question later. :-)
# Adam on 17 Jul 2006 5:53 AM:
Michael: But the question is always expanded later. :)
I've had enough bosses who've asked for "x" and later asked for "x + y", because "y is kind of like x - how hard can it be?" to be very wary of phrases like "let's limit the code page range to ...". There's very often an unspoken "... for now", and, being human, there is the occasional genuine mistake - "Sorry, we forgot 20936. Can you add that? It's /just one more/ code page. It's not really a /new/ problem, it's just an extension of the old one."
# Mihai on 17 Jul 2006 12:23 PM:
"CharNextExA is not broken on XP, as far as I know -- only CharNextW"
Sorry, did not test. Just (wrongly) asumed.
In my mind was "CharNextEx is broken," without realizing that my mind migrated to Unicode a while ago :-)
# Michael S. Kaplan on 17 Jul 2006 12:50 PM:
Hi Adam --
Since this is not an actual interview, I never intended to expand the scope of the problem -- though this is a technique I have used in the past with interview questions, so I don't mind people exploring the idea here. :-)
Hi Mihai --
The 'W' and 'A' versions of these functions are trying to do two very different operations, with no reasonable overlasp between them. :-)
# Michael S. Kaplan on 17 Jul 2006 1:35 PM:
On the other hand, if an interview question is for ABC, then you may ask a clarifying question about the possible future need to support ABCDEFG (you may even leave your design open enough to support it in the future), but simply doing it rather than answering the original question suggests a problem in understanding project scope. :-)
# Mihai on 17 Jul 2006 4:08 PM:
On the other hand, if an interview question is for ABC, then you may ask a clarifying question about the possible future need to support ABCDEFG
If this is about my post, then:
- 2 and 3 are true for the original problem
- for 1 I make clear that I understand the exclusion.
And in an interview I would ask (it takes 10 sec, unlike here, where it can take hours :-)
And, in general, here one can take some liberties that are not ok in an interview (ie mispelling Michael as "Michal", sorry :-)
# Michael S. Kaplan on 17 Jul 2006 4:58 PM:
It was more about Adam's, but the points are still valid. :-)
Please consider a
donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.
referenced by
go to newer or older post, or back to index or month or day