f y cn rd ths, thn cd tht strps yr vwls my nt bther y s mch....

by Michael S. Kaplan, published on 2011/05/31 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/05/31/10169732.aspx

Take the following code snippet for example (names munged to protect the guilty):

BOOL CSillyBackwardsCollection::IsValidAlias(__in LPCWSTR pszAlias)
{
     while ((*pszAlias != chNullTerminator) && (MAX_NAME_LENGTH > index))
    {
        if ( !iswalnum(*pszAlias) && (*pszAlias != L'_') && (*pszAlias != L'-') )
        {
            // not a valid identifier
           return FALSE;

Okay, now it is obvious that there are few cases where every single code point in Unicode shouldn be treated as valid for a name, identifier, or alias.

And it is great that that this code is using Unicode - it has been a long time coming and it is good to see more and more people doing this, by default and automatically.

See the title of this blog? There are languages and scripts that will be impacted the same way in their letters. Don't forget the lessons of blogs like Is Kana 'alphabetic' ? Depends on who you ask....; there are languages that are completely taken out of the running here!

Because a much bigger part of me sees that the biggest problem is the need to overhaul the all of the Microsoft Visual C Runtime CTYPE (character type) functions to be line with both UTS #18 and UAX #31.

For example, see this table for a much better way to classify Unicode characters. The distance betwen this and what the CRT uses is huge!

The CRT needs to grow up and embrace Unicode in all of its uses rather than just using the schemes cobbled together for POSIX compatibility back before Unicode was anywhere significant....

Note to folks who own the CRT -- my schedule is up to date if you want to discuss the needs of UTS #18 and UAX #31 further! :-)

So, what is a Windows developer actually meant to do? Grokking the entire Unicode BMP is out. But we still have user stories being pushed on us that include in the acceptance criteria that only "valid" usernames / email addresses / etc. are accepted.

It just doesn't seem possible to *know* (as a native English speaker) what unicode codepoints outside of the Latin1 set might not - or should not - be valid in usernames or other short strings that we normally don't expect whitespace, punctuation, or control codes in.

It turns out that validating email addresses is woefully hard.

However if you can ignore non-internet email addresses, do what I do:

* there is exactly one at sign

* there are not two consecutive dots on the right side of the at sign

Validating email addresses is easy -- SEND MAIL.

If their mail client can't send/receive the mail then it probably isn't their email address anyway. :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.