f y cn rd ths, thn cd tht strps yr vwls my nt bther y s mch....

by Michael S. Kaplan, published on 2011/05/31 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/05/31/10169732.aspx

Relatively common practices can often be dead wrong.

Take the following code snippet for example (names munged to protect the guilty):

BOOL CSillyBackwardsCollection::IsValidAlias(__in LPCWSTR pszAlias)
     while ((*pszAlias != chNullTerminator) && (MAX_NAME_LENGTH > index))
        if ( !iswalnum(*pszAlias) && (*pszAlias != L'_') && (*pszAlias != L'-') )
            // not a valid identifier
           return FALSE;

Okay, now it is obvious that there are few cases where every single code point in Unicode shouldn be treated as valid for a name, identifier, or alias.

And it is great that that this code is using Unicode - it has been a long time coming and it is good to see more and more people doing this, by default and automatically.

But iswalnum?




This Microsoft CRT function is not (in any version whatsoever) following the latest best practices that Unicode suggests in either Unicode Technical Standard #18: Unicode Regular Expressions or Unicode Standard Annex #31: Unicode Identifier and Pattern Syntax, which means that there are all kinds of perfectly valid Unicode characters that are needed for many of the languges and scripts covered by Unicode that this code will discard.

See the title of this blog? There are languages and scripts that will be impacted the same way in their letters. Don't forget the lessons of blogs like Is Kana 'alphabetic' ? Depends on who you ask....; there are languages that are completely taken out of the running here!

Now a part of me wants to blame this code snippet.

But just a small part.

Because a much bigger part of me sees that the biggest problem is the need to overhaul the all of the Microsoft Visual C Runtime CTYPE (character type) functions to be line with both UTS #18 and UAX #31.

For example, see this table for a much better way to classify Unicode characters. The distance betwen this and what the CRT uses is huge!

The CRT needs to grow up and embrace Unicode in all of its uses rather than just using the schemes cobbled together for POSIX compatibility back before Unicode was anywhere significant....

Note to folks who own the CRT -- my schedule is up to date if you want to discuss the needs of UTS #18 and UAX #31 further! :-)

Chris Becke on 31 May 2011 7:35 AM:

So, what is a Windows developer actually meant to do? Grokking the entire Unicode BMP is out. But we still have user stories being pushed on us that include in the acceptance criteria that only "valid" usernames / email addresses / etc. are accepted.

It just doesn't seem possible to *know* (as a native English speaker) what unicode codepoints outside of the Latin1 set might not - or should not - be valid in usernames or other short strings that we normally don't expect whitespace, punctuation, or control codes in.

Joshua on 1 Jun 2011 10:06 AM:

It turns out that validating email addresses is woefully hard.

However if you can ignore non-internet email addresses, do what I do:

* there is exactly one at sign

* there are not two consecutive dots on the right side of the at sign

Michael S. Kaplan on 1 Jun 2011 12:26 PM:

Validating email addresses is easy -- SEND MAIL.

If their mail client can't send/receive the mail then it probably isn't their email address anyway. :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day