by Michael S. Kaplan, published on 2007/05/03 02:59 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/05/02/2388803.aspx
bcbryant asks:
DBCS trailing bytes always seem to be greater than or equal to 0x40 although the GB18030 mentions some use of 0x30+ in four byte characters. Do you know of any official source for a restriction that would allow me to depend on the trailing bytes being at least 0x30? This restriction would yield convenience in trimming trailing ASCII whitespace from strings in multibyte encodings.
There is no specific rule or restriction on the minimum for the trail byte values (though generally staying out of the very low ASCII ranges seems to have been a pretty consistent theme).
Note that these code pages were all based on various national standards, so as principles go this one may have just been inherited! :-)
There are no plans by Microsoft to add more code pages or modify existing ones ever, so relying on the ones that are there now should be good enough....
This post brought to you by U+0020, a.k.a. SPACE
Ben Bryant on 3 May 2007 7:46 AM:
Thanks!
Michael Dunn_ on 3 May 2007 1:03 PM:
You could use _IsDBCSTrailByte from WTL:
bool _IsDBCSTrailByte(LPCTSTR lpstr, int nChar)
{
#ifndef _UNICODE
int i = nChar;
for( ; i > 0; i--)
{
if(!::IsDBCSLeadByte(lpstr[i - 1]))
break;
}
return ((nChar > 0) && (((nChar - i) & 1) != 0));
#else // _UNICODE
lpstr; nChar;
return false;
#endif // _UNICODE
}
This returns true if lpstr[nChar] is a trail byte.
Mihai on 3 May 2007 3:02 PM:
The logic in the WTL _IsDBCSTrailByte is "if the previous character is lead byte, then this one is trail byte.
It does not work for GB 18030, which can have up to 4 bytes per character.
Michael Dunn_ on 3 May 2007 8:37 PM:
The code is definitely written with the assumption that no code page uses more than 2 bytes for a "character". Isn't that true for all the multi-byte code pages that Windows supports?
Michael S. Kaplan on 3 May 2007 9:20 PM:
Nope, there are several algorithmic and DLL based ones that do not follow these rules (though the lead byte function does not work for those code pages, either!).
Ben Bryant on 4 May 2007 9:41 AM:
Mike, I think you are right that there are no Windows locale "ANSI" code pages that have more than 2 bytes. But even assuming DBCS, I think your code does not work because leading byte codes can overlap trailing byte codes. So you could find that the previous character is a IsDBCSLeadByte even if it is actually a trailing byte.
Ben Bryant on 4 May 2007 11:32 PM:
Oh -- I realized that:
1. _IsDBCSTrailByte is not Mike's; it is WTL's function.
2. the logic is not as Mihai described it: "if the previous character is lead byte, then this one is trail byte" which for example would be a very bad assumption when dealing with any single-byte Shift-JIS character after 9e 9e (U+68CD).
Actually, the logic of _IsDBCSTrailByte is:
"go back to the previous non-lead byte and return true if it is an odd number of bytes before the one in question"
the worst case scenario is extremely inefficient, which is another reason I would like to know if there are any trail bytes under 0x30. I guess I will have to do the survey myself....
Ben Bryant on 5 May 2007 1:23 AM:
make that: "go back to the previous non-lead byte and return true if it is an *even* number of bytes before the one in question"
which makes sense because that is the only way of definitively finding a lead byte -- it has to be after a character that cannot be a lead byte. Once you have a definite lead byte, you've got an achor from which to determine the byte in question. Since all characters back to that anchor are candidate lead bytes, the characters must all be byte pairs. Hence, the test for any even number of bytes between.
The worst case scenario is you have a very long string, with *all* 2 byte pairs in which all the trail bytes happen to be valid lead bytes too.
Yuhong Bao on 26 Nov 2010 11:09 PM:
"There is no specific rule or restriction on the minimum for the trail byte values (though generally staying out of the very low ASCII ranges seems to have been a pretty consistent theme)."
Yea, the fact that some DBCS, including all four of the Windows DBCS codepages, uses part of the ASCII range for trail byte causes a lot of pain. For example: