by Michael S. Kaplan, published on 2007/10/23 10:16 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/10/23/5623838.aspx
This may have happened to you before.
Sometimes I am trying to have a conversation with someone.
And then when I start to say something, they interrupt me because they hear the first thing I said and decide to react to that thing before I have finished the thought.
There was a clear and unmistakable indication that I was not done, but this person who is so sure that he knows better just interrupts anyway.
This person is an insensitive jerk, right?
Well, in this particular case, the person is not, in fact, a person. It is a computer program. :-)
Maybe it is a managed WinForms application, responding to a TextBox control's TextChanged Event that indicates the contents of the control have changed.
Or maybe it is an unmanaged Unicode Win32 program, responding to each WM_CHAR Notification that indicates a character has been added to a control.
Perhaps these programs are trying really had to validate the content of the control to keep the "bad characters" (whatever they may be) from being entered. Or some equally reasonable process may be taking place. But it honestly does not matter -- because these programs, like the utter boob of an antisocial little weasel I talked about earlier in the post, are being insensitive jerks....
Because if the character one sees is a high surrogate code point (U+d800 to U+dbff), then responding to the event or the notification at that moment is premature -- since a low surrogate code point (U+dc00 to U+dfff) is expected to follow shortly, as the very next event fired/notification received.
Any validation one does such as the program in the bug I looked at recently that they decided not to fix in some Advanced Settings for Services dialog that was throwing an exception (whose message is mistakenly (i.e. dumbly) referring to surrogate code points as surrogate characters!):
System.ArgumentException occurred
Message="The surrogate pair (0xD840, 0xD840) is invalid. A high surrogate character (0xD800 - 0xDBFF) must always be paired with a low surrogate character (0xDC00 - 0xDFFF)."
Source="System.Xml"
StackTrace:
at System.Xml.XmlTextEncoder.Write(String text)
Now at least with the person one can try to lecture him a bit, and tell him "Look, you wanker! If you'd let me finish what I was saying you'd see you were about to get that low surrogate. Why are you for freaking impatient? Are you like this with your wife? Maybe that's why she's off shagging the pool-boy ya freakin' idiot!"
But with a program throwing an exception, what can one say? All one can do is just console oneself with the knowledge that a program can (in essence) be as much of a wanker as a person, if you give it a chance....
And what the hell kind of program is so insecure about people entering invalid data that it has to be validating as it goes?
This post brought to you by ⧜ (U+29dc, a.k.a. INCOMPLETE INFINITY)
# Serge Wautier on 23 Oct 2007 12:37 PM:
Isn't it some kind of design bug when events are sent for each half code point ?
# Michael S. Kaplan on 23 Oct 2007 12:38 PM:
Nope -- because the documentation is clearly suggesting UTF-16, not Unicode Scalar Values (which might imply UTF-32).
# Centaur on 23 Oct 2007 2:25 PM:
A simpler example of such premature validation is when an application expects an integer and tries to parse the entered string as an integer at every change. Then, the user starts to enter a negative integer, presses '-', gets a validation error “'-' is not a valid integer”. Stupid program, I’ve just started.
Another case is two edits for a range of integers. User enters the low end, starts entering the high end, gets a validation error, “the high end must be greater than the low end”.
Validation should almost always done only in response to an OK or Apply button, not on every change. Validating on focus loss is not much better.
# Michael S. Kaplan on 23 Oct 2007 2:47 PM:
Excellent point -- of course that one everybody would agree with, while the surrigate pair case, many developers argue against (and they did decide not to fix that bug in their upcoming release!).
# Serge Wautier on 23 Oct 2007 5:25 PM:
> because the documentation is clearly suggesting UTF-16
You mean documenting a bug makes it a feature? ;-)
# Jeffrey L. Whitledge on 23 Oct 2007 5:54 PM:
I once had to deal with a masked edit date control that validated on every date-part. So to get from 2008-10-31 to 2008-02-01 you had to change the day of month before you could change the month, since Feb 31 is invalid. It was a stupid pain every time.
Masked edit controls are evil, and I am glad that nobody uses them anymore!
# Mihai on 23 Oct 2007 8:50 PM:
I can only see two acceptable options (nothing to do with i18n, just good user experience):
- in the Ok/Apply (especially if you have to correlate between several fields, for instance if the country is U.S., then state becomes mandatory)
- in the change notification but non-blocking (i.e. make the field background red)
But in general I am against any validation, if you cannot make it smart enough. Things like "zip code should be digits only" (not if I am outside US), or "you can only use letters and digits in the street name" (not if my street name has accents), etc.
# Mike Dimmick on 24 Oct 2007 6:06 AM:
So you're basically saying that WM_CHAR can send you surrogates in the case of typing characters in the supplementary planes, and therefore the naive implementation of appending the typed character to your current string, then updating the display with that string, is flawed? It depends rather on what GDI (or other drawing API) will do with a string containing just a high surrogate followed by an ordinary character.
Should we all be handling WM_UNICHAR instead/as well?
What the documentation does not appear to say is when you get a WM_UNICHAR message. Well, it says that the message is posted to the window when TranslateMessage processes a WM_KEYDOWN message. Spy++ doesn't show this but then the version I have (8.00.50727, came with VS2005) seems a little faulty - maybe it is itself hooking using an ANSI function? - as when I type ỳ (U+1EF3, not on Windows-1252) on my UK extended keyboard layout it's coming through to Spy++ as 0x3F = '?'.
# Mike Dimmick on 24 Oct 2007 6:27 AM:
In the 'excess validation' case I have continuing problems where a business or government service is using Royal Mail's Postcode Address File database (http://www.royalmail.com/portal/rm/jump2?catId=400084&mediaId=400085) to validate addresses and rejecting anything not found in that database. Here's the joke - my valid address is not in PAF. I've asked Royal Mail, and they won't change the record.
I live in a house that was divided into two flats, but the two flats share a single front door. The local council, for tax purposes, considers the two flats as separate properties (and therefore are taxed separately), but Royal Mail consider it one address. The house number was 17, my landlady who did the conversion calls my flat 17A, and the council calls it First Floor Flat, 17. When I moved my bank accounts and credit cards I wasn't aware of the council's designation so used '17A'.
Here comes the fun. I paid for a TV licence (here in the UK, the public service broadcaster, the BBC, is funded mainly through compulsory licences to operate a TV set) using 17A as the address. However, I didn't notice that they 'validated' the address by removing the A. When you buy TV equipment, the retailer is obligated to inform the TV Licensing Authority of the address to ensure it's licensed. Because I bought a new TV online, I had the delivery address the same as the credit card address. TV Licensing did a lookup and found no licence for 17A. I try to modify the licence address online, I can't change it because it 'validates' it back to 17 again. I went through a lot of unanswered emails and phone calls before they finally responded and have accepted that this is a separate address.
Royal Mail won't change the address on PAF because of the shared mailbox (it's just a flap on the shared front door).
I think there is some part of the documentation for PAF which tells users that it shouldn't be considered authoritative, but when did developers read the documentation? <g>
# Michiel on 25 Oct 2007 7:16 AM:
Validation of a string in such cases has three results: illegal, legal string and legal prefix. The specific case is a good example of a legal prefix. It's possible to add a low surrogate to make it a legal string.
With this concept you can show an error message when an invalid string is entered and show the apply button if a legal string is validated. In the legal prefix state you can provide more input.
# Maurits [MSFT] on 26 Oct 2007 5:47 PM:
I remember a post on thedailywtf.com a long time ago where someone wrote a slightly insecure user authentication system:
This was the web page:
Username [ _______ ]
Password [ _______ ]
After the user typed in the username, and while the user was typing in the password, the site would use AJAX to see if the password *so far* was correct (that is, whether it was a prefix of the correct password.)
When the first wrong character was typed, the page would display an "Incorrect password!" message.
How convenient.
The code reviewer responded by calling the developer over, entering the developer's username, and beginning the incredibly easy task of figuring out the developer's password, one character at a time...
Jan Kučera on 12 Nov 2007 7:02 AM:
A fledgling question here.. I just wonder.. how could one enter these surrogate characters? I did not find mentioned range in the Character Map...
referenced by