by Michael S. Kaplan, published on 2010/11/01 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2010/11/01/10083846.aspx
It was just days ago that Weijiang asked one of those questions that comes along every now and again that makes the way things work on microsoft platforms (known to some as The Way Things Work™) seem a little off.
Weijiang's question was:
Hi. I met a unexpected problem when using string.IndexOf. The following code demonstrates the problem:
string r = "\ufffd\ufffd\ufffd\ufffd";
string tar = "a";
Can you guess the output? The output is 0, which is very weird for me. Can someone explain why? Because this has broken my program, which assume if a.IndexOf(b)>= 0 Then a.Length >= b.Length.
Now this question had built into it the opportunity to both correct the question and shame the technology at the same time. Usually I would never turn such an opportunity down!
But before I really had an opportunity to craft a response, Pavel beat me to the punch with a very well-thought-out reply:
First of all, s.IndexOf(“”) always returns 0 for a non-empty s, for obvious reasons.
Going from there, your assumption about relation between IndexOf and Length is generally incorrect, even for characters other than U+FFFD. For example, U+2060 (word joiner) and U+00AD (soft hyphen) are also treated as “empty”, and thus string “\u2060\u00ad” would be treated the same as “” for the purposes of culture-sensitive comparisons, which is where your result comes from. This is also the case for Equals, CompareTo and other String methods so long as you request culture-sensitive comparison. It just so happens that IndexOf does so by default, while e.g. Equals and Contains do not.
Generally speaking, this is consistent with user expectations, since those characters are not part of the “semantic load” of the string as far as user is concerned. E.g. consider user copy-pasting a string with a soft hyphen (which, unless the line break occurs, is not observable to him) into the search dialog. He’d be quite surprised if your app says that it couldn’t find it, while he can clearly see it in the text!
On the other hand, String.Length is a simple counter of 16-bit code units (i.e. chars), without any special treatment to some over others. If you want to match that with IndexOf, use the overload which takes StringComparison.Ordinal to explicitly request code unit comparison.
The difference in defaults is certainly quite error-provoking, though. So much so that many .NET coding styles require always requesting either culture-sensitive or ordinal comparison explicitly (by using StringComparison or CultureInfo) for methods which permit both, even when the requested mode is the default for that particular method.
Now in addition to his excellent examples, there are other trhings I would point out.
Like the fact that the original assumption "if a.IndexOf(b)>= 0 Then a.Length >= b.Length" that was broken. all one needs is examples like "\u00e5".IndexOf("\u0061\u030a") to disprove that assumption (that is a ring and a plus combining ring for those who don't speak Unicode code points.
In fact, that is why all the extra work went into FindNLSString to provide not only a return value like IndexOf returns but also to return the length of the found string -- since one cannot make assumptions about the length of the found string based on the length of the string one is trying to find. The extra support in FindNLSString points to a real hole in the scenario of usefulness of several potential ways one might want to utilize culturally sensitive comparisons -- a limitation that still exists in .Net even in the latest version.
Given the lack of this support in earlier languages like Java, it isn't that surprising that .Net hasn't considered too important to add.
After all, me alone clamoring for something is generally not a good enough reason to do anything, since I clamor for so many things. :-)
But i digress....
Afterwards, I sent Pavel some mail complimeneting him on his response, and he replied that he did feel like his answer was perhaps a little incomplete:
I dodged the original question somewhat, since I didn’t explain why U+FFFD specifically is treated as a “noncharacter” (newspeak seems highly appropriate here somehow). And that’s because I don’t know, and don’t have any good guess as to why. Logically speaking, it’s “something we didn’t know how to handle”, so whether it was meaningful to the end user or not, we do not know. It would seem that, by considering it unimportant, we’re making a wild guess there.
A dodge? He may have been harder on himself than he had to be.
If a person is on trial for a crime and the prosecutor's evidence relies on an illegal search, then it may be a technicality to get the case thrown out (and therefore ignore the fact that the crime may have been committed), but I wouldn't consider it a dodge.
I find it kind of cool that Pavel didn't have a good guess as to why the behavior is what it is, since when I originally did the work in FindNLSString my goal was based on my own (naive) notions of intuitive behavior, many of which were different than the behavior eventually supported by the function -- on the basis of the need to match the .Net functionality (the function was added to upport synthetic locales in .Net, so that behavior matching was considered pretty crucial from a scenario perspective).
It turns out that must fo the acual behavior was done for the sake of expectational behavior based on behavior in Java, since it was assumed that lots of the .Net developers may have once been Java developers. A lot like the way DOS behavior was so often CP/M based (something I found helpful when I moved from CP/M on an Osborne 1 to DO on a PC all those years ago).
For me the chain of evidence runs dry, for two reasons:
The original question, focusing on U+FFFD (REPLACEMENT CHARACTER), hits issues I have often discussed in the past in other blogs:
And it is interesting how all those various connections occurred to me after the original question -- how it all ties in together on a bunch of design different not all entirely intuitive design decisions that now have far-ranging consequences on function/method behavior and security....
Dom on 2 Nov 2010 12:01 AM:
Java and to some degree .Net are the main choices because they have been consistently pegged as the “safe” choice to go with for mid-level project managers in the corporate world. No one was ever fired for choosing Java or Microsoft.
However, there are many large distributed applications these days that run primarily with technologies like Python, PHP, et al. Even companies like Google and Yahoo are heavily invested in these technologies. Java may be the main choice for enterprise development now, but it’s days are numbered as the only stalwart option to go with.
Let’s face it, many of these so called “enterprise applications” could easily have been written much faster and with less overhead using technologies like Python, PHP, et al.
<a href="www.developintelligence.com/.../ajax-training.php">ajax training</a>
j on 2 Nov 2010 7:14 AM:
The main issue is .Net inconsistently and unexpectedly forcing cultural/linguistic notions into places developers don't expect them.
Some framework designers clearly made the decision to include surprising behavior (culture-sensitive comparison, for example) in places that it is known that developers would expect to be ordinal comparison.
Something like IndexOf is not something most developers think of as linguistic. Any more than getting the character at index i is linguistic.
There is clearly a political/religious aspect to this kind of behavior in the .Net Framework, as in teaching developers to be more internationalizable in their programming. But it leads directly to correctness issues (like the ones discussed in this article).
This isn't due to dumb developers not understanding their tools, it's due to tools designers making decisions inconsistently. The String decisions violated the Principle of Least Surprise and are an antipattern IMO.
Only UI related and user input code shouldhave these cultural notions forced on them.
Michael S. Kaplan on 2 Nov 2010 7:03 PM:
Hey j, Note that there is a built in difference between Equals and Compare that people hate because they figure the results should be the same. Intuitve doesn't always make sense, and it isn't always correct. Many see these kinds of compromnises as Solomon suggesting the baby be cut in half, if you know what I mean.
Pavel Minaev on 4 Nov 2010 2:30 PM:
Personally, I think that the choice of going with a mix of culture-sensitive-by-default and culture-insensitive-by-default operations on System.String was a design mistake, but it is what we have today. If you kept an eye on .NET 4 development, there was actually an attempt to get closer to a saner model where the default at least is the same, but it was shot down because it broke too much code, and often in a quiet way.
What I personally take away from this is that, if I were asked to design a new String class for some framework, I'd make all methods explicit about culture-sensitivity: caller has to say what he wants and take responsibility for that choice. No defaults.
Maybe in a few decades we'll even get there. ;)
Michael S. Kaplan on 4 Nov 2010 3:38 PM:
They sort of do that now -- via FxCop, every use without an explicit culture etc. is flagged.
go to newer or older post, or back to index or month or day