Why I don't like the IsTextUnicode API

by Michael S. Kaplan, published on 2005/01/30 02:38 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/01/30/363308.aspx


The IsTextUnicode API has been around since NT 3.5, according to the Platform SDK histories. According to the PSDK, its purpose is as follows:

The IsTextUnicode function determines whether a buffer is likely to contain a form of Unicode text. The function uses various statistical and deterministic methods to make its determination, under the control of flags passed via lpi. When the function returns, the results of such tests are reported via lpi.

It then goes on to describe the many different tests that it can do when the appropriate flags are passed:

IS_TEXT_UNICODE_ASCII16
   The text is Unicode, and contains only zero-extended ASCII values/characters.

IS_TEXT_UNICODE_REVERSE_ASCII16
   Same as the preceding, except that the Unicode text is byte-reversed.

IS_TEXT_UNICODE_STATISTICS
   The text is probably Unicode, with the determination made by applying statistical analysis. Absolute certainty is not guaranteed. See the following Remarks section.

IS_TEXT_UNICODE_REVERSE_STATISTICS
   Same as the preceding, except that the probably-Unicode text is byte-reversed.

IS_TEXT_UNICODE_CONTROLS
   The text contains Unicode representations of one or more of these nonprinting characters: RETURN, LINEFEED, SPACE, CJK_SPACE, TAB.

IS_TEXT_UNICODE_REVERSE_CONTROLS
   Same as the preceding, except that the Unicode characters are byte-reversed.

IS_TEXT_UNICODE_BUFFER_TOO_SMALL
   There are too few characters in the buffer for meaningful analysis (fewer than two bytes).

IS_TEXT_UNICODE_SIGNATURE
   The text contains the Unicode byte-order mark (BOM) 0xFEFF as its first character.

IS_TEXT_UNICODE_REVERSE_SIGNATURE
   The text contains the Unicode byte-reversed byte-order mark (Reverse BOM) 0xFFFE as its first character.

IS_TEXT_UNICODE_ILLEGAL_CHARS
   The text contains one of these Unicode-illegal characters: embedded Reverse BOM, UNICODE_NUL, CRLF (packed into one WORD), or 0xFFFF.

IS_TEXT_UNICODE_ODD_LENGTH
   The number of characters in the string is odd. A string of odd length cannot (by definition) be Unicode text.

IS_TEXT_UNICODE_NULL_BYTES
   The text contains null bytes, which indicate non-ASCII text.

IS_TEXT_UNICODE_UNICODE_MASK
   This flag constant is a combination of IS_TEXT_UNICODE_ASCII16, IS_TEXT_UNICODE_STATISTICS, IS_TEXT_UNICODE_CONTROLS, IS_TEXT_UNICODE_SIGNATURE.

IS_TEXT_UNICODE_REVERSE_MASK
   This flag constant is a combination of IS_TEXT_UNICODE_REVERSE_ASCII16, IS_TEXT_UNICODE_REVERSE_STATISTICS, IS_TEXT_UNICODE_REVERSE_CONTROLS, IS_TEXT_UNICODE_REVERSE_SIGNATURE.

IS_TEXT_UNICODE_NOT_UNICODE_MASK
   This flag constant is a combination of IS_TEXT_UNICODE_ILLEGAL_CHARS, IS_TEXT_UNICODE_ODD_LENGTH, and two currently unused bit flags.

IS_TEXT_UNICODE_NOT_ASCII_MASK
   This flag constant is a combination of IS_TEXT_UNICODE_NULL_BYTES and three currently unused bit flags.

Sound impressive and interesting enough yet?

A bit of trivia -- the code for a flag that used to be documented (IS_TEXT_UNICODE_DBCS_LEADBYTE) is still there (and it is still in the header file, obviously -- the PSDK never breaks people like that). But the flag does not work well, so it is probably just as well that it is not documented any more. I highly recommend not passing it. Or ignoring when it is returned. The flag not dangerous or anything; it's just not too terribly useful for its intended purpose (detecting text that is actually DBCS).

As I mentioned, the API has been around since NT 3.5. It was written by someone else, outside of the NLS team (such as it was in those days). That is fairly cool since there was not as much Unicode awareness/acceptance back then as there is now....

In those heady days when to most developers Unicode was little more than a foreign word that translated to "twice the memory and space required for strings", this function was mostly used as a way to know when to call WideCharToMultiByte to know when to convert strings out of Unicode1, and there were very few callers even for that not-so-noble purpose. NT 4.0 did not see much of a usage explosion, although Windows 2000 did , where the number of callers throughout the entire Windows source tree just about tripled (to 65 or so callers). Not much movement on the caller side in XP or Server 2003, either. I don't mind this fact much, given why it mostly seemed to be used.

Some time between XP and Server 2003, I did add it to MSLU, as a nice gesture to developers who were frustrated by NT-only APIs2.

Nevertheless, as the title of this post indicates, I don't like the IsTextUnicode API.

You may think you know why -- go ahead, I'll give you three guesses.

Guess #1: Because I do not own it?

Sorry, that's not it -- but your opinion about my ego is noted. :-)  Strike one!

I'll give you a hint.

Hint#1: Look at the Platform SDK description (I'll add emphasis to enhance the hint):

The IsTextUnicode function determines whether a buffer is likely to contain a form of Unicode text. The function uses various statistical and deterministic methods to make its determination, under the control of flags passed via lpi. When the function returns, the results of such tests are reported via lpi.

Guess #2: Excuse me, I meant because the NLS team does not own it?

Hmm, sorry. I figured that was you meant the first time. Strike Two!

I'll give you another hint.

Hint #2: There has only been one substantive change made to this API from the time of its creation until Server 2003 shipped -- a const was added to the lpBuffer parameter.

Got it now? Think carefully now, this is your last guess.

Guess #3: Because it considers "CRLF (packed into one WORD)" to be illegal, even though U+0d0a is MALAYALAM LETTER UU?

Ooh, good one -- that looks like a bug in the IS_TEXT_UNICODE_ILLEGAL_CHARS flag detection. Even cooler that you properly figured out the byte reversal issue. Or maybe you did not notice that part, since both that ASCII CRLF packed into a WORD and the character would reverse on little-endian systems to look like 0x0a0d in memory, and if you did not allow for byte reversal you would have been right then anyway.

Given the support for Malayalam described previously in the post Lions and tigers and bearsELKs, Oh my!, this is kind of embarrassing. Or maybe given the fact that the code point has been allocated since Unicode 1.1 (according to DerivedAge.txt) which was released in June of 1993 (according to enumeratedversions.html), this is particularly embarrassing. Though that does make the comment over its use in the API source pretty amusing:

            //  The following is not currently a Unicode character
            //  but is expected to show up accidentally when reading
            //  in ASCII files which use CRLF on a little endian machine.

If you think about it, most UTF-16 big endian files would be from other operating systems and have just a CR or just an LF for their line breaks, even if they were just ASCII. I guess we know why there is no big-endian check for illegal characters? :-)  Makes the whole IS_TEXT_UNICODE_ILLEGAL_CHARS check weird even if it were not totally busted anyway.

For MSLU fans, yes I ported this bug there as well, though not on purpose. Sorry about that, I am not used to reading code points as reversed bytes....

Of course, since I did not know about this problem before, it can't be why I started this post not liking the API. Hell, if not for this imaginary conversation I put together, I still wouldn't know about it. Lucky for everyone that I have displayed this psychological dysfunction in public and thus cannot be further embarrassed by reporting the bug on it, right? Strike 3!

Or we could call it a foul tip, since you found a decade-old bug and all. Ok, it is still Strike 2. :-)

One more hint:

Hint #3: There has been no change to this API's underlying mechanics since at least NT 3.51 (and probably since the original NT 3.5 release).

Any more guesses?

Guess #4: Because it only seems to test the first 256 bytes, no matter how big of a string I pass?

Well, no. I never cared too much for that one, even before I came to Microsoft. But I never really found a file where it made a difference. It would be nice if someone were to change this, but I wouldn't lose any sleep over it -- so it's definitely not a reason to dislike an API. Strike 3!

Ok, I'll just tell you now. Because as an API intended to verify whether a string is following a standard, it wins an award for its obtusitality. Why on earth would the following not have been added, over the years if not in the initial release?

IS_TEXT_UNICODE_UNPAIRED_SURROGATES
   
Since it is invalid to have a high surrogate without a low surrogate following it and a low surrogate not proceeded by a high surrogate, why not detect such non-conformant cases?

IS_TEXT_REVERSE_UNICODE_ILLEGAL_CHARS
   It seems only fair to round out the checks for UTF-16BE by including the reverse version of this flag, doesn't it?

IS_TEXT_UNICODE_INVALID_FOR_4_00
   Obviously new flags could be added for each major version -- what better way to check for what is invalid then to check against an official "valid" list?

IS_TEXT_UNICODE_INVALID_SCRIPT_USAGE
   
There are all kinds of sequences that would indicate bad usage, from combining marks from one script used in an unrelated script to illegal sequences to text with invalid ordering per the canonical combining classes, and so on.

IS_TEXT_UNICODE_VALID_UTF8_PER_RFC2799
   The initial description of UTF-8 in RFC 2279, which I think is the method used by Notepad3.

IS_TEXT_UNICODE_VALID_UTF8_PER_UNICODE
   
The more strict definition of UTF-8, which disallows surrogate code sequences and other non-shortest forms.

IS_TEXT_UNICODE_VALID_UTF32 / IS_TEXT_UNICODE_VALID_REVERSE_UTF32
   
These flags could be combined with some of the older signature detection flags if a UTF-32 LE or BE signature is found.

IS_TEXT_UNICODE_UCS2_32 / IS_TEXT_UNICODE_REVERSE_UCS2_32
   
Analagous to the IS_TEXT_UNICODE_ASCII16/IS_TEXT_UNICODE_REVERSE_ASCII16 flags, they would detect UTF-32 that looks like it could all be represented as UTF-16 without needing surrogate pairs.

You get the idea -- Unicode is a dynamic standard, getting more interesting and more complicated all the time, not just for its own sake but in how the platform uses it. How can an API which is written a decade ago and never updated, whose job is to ask "is this flipping buffer full of Unicode text?" ever hope to keep up with such a standard?

 

1 - Notepad being a noteworthy exception to this rule, since it used the API to try to detect when a text file was Unicode without a BOM.

2 - Similar to why BeginUpdateResource, UpdateResource, and EndUpdateResource were added, though I must admit that for the *UpdateResource APIs it was mainly due to the fact that former MSFTie Matt Curland did all the work to make the functions Win9x-friendly. :-)

3 - These are the rules that have been used by MultiByteToWideChar in later years. Ironically, the MultiByteToWideChar API is used by Notepad to convert files that it detected as UTF-8 by using RFC 2279 rules, meaning that any illegal sequences will be dropped without so much as a warning. Better keep those CESU-8 files away from recent enough versions of Notepad!

 

This post sponsored by out much maligned little brother "ഊ" (U+0d0a, a.k.a. MALAYALAM LETTER UU)
Who, like the rest of the Malayalam script, felt very supported by XPSP2, only to find out that the IsTextUnicode API did not share that opinion....


# Ken Smith on 30 Jan 2005 6:39 PM:

Any chance if a fix for downlevel platforms? :-)

# Michael Kaplan on 30 Jan 2005 7:12 PM:

I assume you mean the Malayalam thing, right?

I'm not sure -- it has been broken since the API was written and nobody ever noticed it before. That might make it a harder sell, especially given the overall limitations in the API....

# Joel Cairney on 11 Sep 2005 4:19 AM:

Yesterday, Buck Hodges was talking about how TFS Version Control determines a file's encoding: ...

referenced by

2008/03/25 Bush might've still hid the facts, but he can't hide them from Vista SP1/Server 2008 Notepad!

2007/06/28 Tell yourself 10 times that you don't own that anymore

2007/06/22 Your VC++ files don't support Unicode identifers? Drop a BOM on them!

2007/04/22 The Notepad encoding detection issues keep coming up

2006/07/11 More on that which breaks Windows Notepad

2006/06/22 Things I [don't] like about blogging

2006/06/14 Behind 'How to break Windows Notepad'

2005/09/11 Working hard to detect code pages

2005/01/30 We broke CharNext/CharPrev (or, bugs found through blogging?)

go to newer or older post, or back to index or month or day