More on the fabled EqualString
by Michael S. Kaplan, published on 2005/11/22 01:01 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2005/11/22/495033.aspx
If you go back all the way to April and look at some of the comments to the What the %#$* is wrong with German sorting? post you will see one of those times that the fact that we have to equate sorting and comparison can confuse people about the results.
And about a week ago in the Hungarian is even more complicated than I thought post, I explained why the lack of a companion function to CompareString that tested for equality kept a particular (uncommon?) feature in Hungarian collation from being supported.
But the thing that has generated the most email was neither the issues related to German nor the issues related to Hungarian; it was my contention that the difference between CompareString and a theoretical EqualString function was orders of magnitiude greater than the difference between RtlCompareUnicodeString and an RtlEqualUnicodeString.
A bunch of people wanted to understand better why the performance difference is so... well... different.
And a bunch of other people want to know why, if the Rtl* difference is so near to inconsequential yet they get the two functions why the NLS side of the world would not get two functions if the performance difference is so bleeding consequential.
Two different, excellent questions in there (and I did retain the most amusing of the wordings of the two questions). To answer the first question, I will explain further the difference between the two scenarios (the binary/ordinal vs. the linguistic).
- RtlCompareUnicodeString has a simple job -- take two strings and return 0 if they are equal, < 0 if string1 is less than string2, and > 0 if it is greater, in a binary/ordinal sense. Of course this is an ordering that will be unsatisfying to any typical user, and I would suspect most of the atypical ones.
- RtlEqualUnicodeString has an even simpler job -- take two strings and return TRUE if they are equal and FALSE if they are not, in a binary/ordinal sense.
Now asking a candidate to write the ideal implementations for each of these functions and then with both of them on the board going through the expected differences in performance between them could make for a mildly interesting Microsoft interview question. But beyond that, the difference is not all that significant (in that post last week I hinted at this and the fastest underlying implementations when I mentioned that "the difference of two numbers not being zero" and "two numbers not being equal" was really not going to be all that significant).
The key issue here is that these functions are looking through the two strings. As soon as they note any difference of any kind, they can return the result. Period.
The world of the less lexicographic and more linguistic CompareString and mythical EqualString functions is a very different one. Because in this world those linguistic weight tables are used.
This is hardest on CompareString where with those weights there are so many different levels of difference, with some levels trumping other levels. Therefore, any difference between the two strings is an interesting bit of trivia unless no greater difference is found later on. Those lower levels include all of the potentially ignorable distinctions in e.g. case, diacritic, width, or kana. Unless a Unicode weight difference is found (which lets CompareString return more quickly), it will have to enumerate the full strings, storing up those lesser differences in case it needs them.
And this is where the fictional EqualString function would have it easier -- because like its binary/ordinal cousins it would be able to return after any difference is found, of any weight. This is (potentially) hugely faster for so many of the potential strings that CompareString has to test for.
This should answer the first question (why the performance difference is so... well... different).
The second question (if the Rtl* difference is so near to inconsequential yet they get the two functions why the NLS side of the world would not get two functions if the performance difference is so bleeding consequential) is a bit more complicated.
There would be essentially two good reasons to consider adding such a function to a future version of Windows:
- The need to have separate comparison/sorting operations to provide appropriate linguistic support;
- The need to have such an absolute linguistic equality operation done frequently enough that the perfomance difference would be significant in implementations.
The linguistic support argument is a tough one for two reasons:
- Existing implementations will by now be depending on the fact that we combine the operations of comparison and sorting, so we could not remove one functionality without breaking callers.
- In most cases, there really is no difference that would confuse anybody -- in fact to date I do not know of any complaint beyond the ones that I directly inspired by pointing out the differences to native speakers of particular languages either to directly answer the question or in this blog! This suggests that the difference is more theoretical in the real world of what people would be sorting/comparing on computers.
The performace question is harder to discount, although it significant to note that good implementations that are building indexes would be using the LCMapString function with LCMAP_SORTKEY flag to build sort keys. Obviously such sort keys are numbers already and allow both comparisons and sorts to be done even more quickly than the imaginary EqualString function could do its work. And it is hard to argue that "Microsoft must add something fast" if the person asking is actually not using something faster already....
In fact, sort keys bring the linguistic comparison down to the level of the binary/ordinal world where RtlCompareUnicodeString and RtlEqualUnicodeString do their work. And in using sort keys the difference between the two different oprerations of comparison and sorting is on the same order of magnitude as between the binary/ordinal functions!
The other problem is that in the small number of scenarios where sort keys are not a practical solution, a linguistic comparison is not appriopriate (e.g. filenames and other symbolic identifiers). That would mean that converting the unreal EqualString into a real NLS API would actually encourage bad usage and bad development practice!
(I will talk more about this issue with symbolic identifiers another day, and also it will cone up in an upcoming presentation at the Internationalization & Unicode Conference on March 6-8 entitled 'Tales of Incorrect String Comparisons'.)
It may be of interest to some that the new in Vista FindNLSString function probably has more in common with the fairy-tale EqualString than it does with CompareString since it too pays no attention to non-ignored differences other than to keep looking for the substring. However, the fact that FindNLSString is also looking for the location within the substring causes it to be slower to answer a different question; the mythological EqualString function would still be faster.
(I will talk more about this issue with the comparison and contrast of FindNLSString, CompareString, and LCMapString with LCMAP_SORTKEY flag another day!)
But irregardless of whether it had been able to justify its future inclusion, the arguments for and against the invented1 EqualString are certainly worthy of a blog post!
1 - Special thanks to the Microsoft Word 2003 Thesaurus for its support here in finding appropriate words to use for the unreal EqualString function!
This post brought to you by "ß" (U+00df, a.k.a. LATIN SMALL LETTER SHARP S)
(A letter that was overheard in the character locker room after reviewing this post muttering "Curses! Folied again...")
# David on Tuesday, November 22, 2005 8:34 AM:
Just had to tweak me, didn't you :P. I'm pondering whether use of "irregardless" qualifies for the 6th circle (Heretics) or the 8th circle (Sowers of Discord/Scandal/Schism).
# Stuart Ballard on Tuesday, November 22, 2005 10:24 AM:
This actually seems like a good place to post something I've wanted to ask your advice about for a while.
See, reading your blog has convinced me that perhaps collation is something I ought to care about and make sure I get right (until recently I basically ignored the issue).
Background: I work on a C# web application using SQL server through an ORM tool I wrote myself. I care enough to do some once-off work to get this right but not to have to put in extra effort at every DB call or string-equality check, and the other programmers at my company will probably not care at all - issues that apply to Turkish are hard to sell when we barely need to localize for "outside the tristate area". So I'd like to do something at the ORM or SQL server layer to arrange that the right collation is just used automatically and C# programmers don't need to think about it.
As far as I can tell SQL server uses case-insensitive English collation rules by default; C# uses ordinal equality. This difference by itself is enough to have introduced bugs on occasion (it's terribly tempting to assume when you get an object "where somefield = @x", obj.SomeField == x will evaluate to true afterwards) but it gets worse.
It seems to me that there are three different operations which need different collation rules. Equality testing wants to be ordinal to match what a C# programmer expects (a C# programmer is already used to explicitly uppercasing or lowercasing if they want case-insensitive equality). Sorting ("order by") is probably for presentation to the user and should use the current culture (English) and probably be case-insensitive, or at least, words that differ by case should sort together. And searching ("foo like '%' + @str + '%'") should definitely use the current culture but case-sensitivity is an open question; it's probably okay to expect the user to do some work like uppercasing or lowercasing the string so the case-sensitivity is explicit, but being case-insensitive might be okay too.
(See, this *is* relevant to your post! Equals versus Compare versus Find...)
The problem is that it's perfectly reasonable to expect that all these things might happen in the same query! "select from whatever where category_identifier = @cat and name like '%' + @search '%' order by name"
Is it possible in any way whatsoever to persuade SQL server that equality comparisons should use ordinal equality but like and orderby should be culture-based?
# orcmid on Tuesday, November 22, 2005 12:09 PM:
Now you got me thinking and that's always a bad idea.
It strikes me that for mythEqualString to work, one can always consider that there are canonical strings (since we're looking at equivalence classes). And context is important, so wanting to know mythCanonicalString is locale-, usage-, and immediate-context- and whatever-else-dependent. Framing the question as one of arriving at canonical forms for equal strings seems like a good way to illuminate the problem and characterize cases.
One could *specify* mythEqualString in terms of CanonicalString (and of course really do speed-ups and short-circuits in the [cached] implementations).
I suppose mythCanonicalString *might* also be suitable for specification mythCompareString too , but maybe not. (Because mythCanonicalString is idempotent, it is specifiable as a test of whether a string is already canonical too.)
But most of all, this suggests to me a very good way to characterize all of the difficulties of multiple forms for character encoding, of the impact of context, and the problems of treating or passing around material either decontextualized or from clashing contexts.
My view is that in general, context is always externally and situationally determined, and the difficulty of handling texts procedurally is thereby exposed more clearly too by looking at canonicalization as context-specific.
So, your mythical comparisons strike me as extremely valuable in narrating the difficulties of text-string comparisons, and illustrating the kinds of differences that arise as one takes on the mythical challenge.
Sounds too easy. Has this already been done for Unicode cases? Does it break down? I'm curious (but not enough to pay the freight to the next Unicode conference).
# Maurits on Tuesday, November 22, 2005 5:55 PM:
OK, why is there no EqualString function?
# Luke Amery on Tuesday, November 22, 2005 6:54 PM:
Stuart, (sorry to butt in Michael), I have been vaguely experimenting with what you are discussing. I have come across casting nvarchar / varchar strings in SQL Server to varbinary to achieve a binary/ordinal equality operation. The results have been what I have expected so far. Can anyone point out the error of my ways?
# Michael S. Kaplan on Wednesday, November 23, 2005 10:24 AM:
Hi David -- it is in the dictionary, so it is a word. Not even a prescriptive grammarian can do more than say that I am just guilty of informal usage, but they could just read what I write and come to that conclusion anyway....
Hi Stuart -- I have posted several times about issues where SQL Server and the CLR do different things and those differences lead to problems, so clearly I agree with you. This would just be more examples of the same. Though I do not agree with all of the changes you are uggesting in regards to equality comparisons needing to be ordinal always -- that still ought to be an explicit decision, for both SQL and the CLR.
Hi Orcmid -- the problem is deciding when Unicode canonical equivalence should enter into equality determinations. Perhaps sometimes, but when? Maybe the answer is to check with the AD folks and similar customers and see if they would ever need that sort of thing (though usually they would just use sort keys anyway).
Hi Maurits -- not sure how I convinced you, since a lot of the article explains why it may be a bad idea? :-)
Hi Luke -- you can also use *_BIN or *_BIN2 collations rather than casting to do that sort of comparison. The questions about SQL vs. CLR and their behavior differences are still there....
# Maurits on Wednesday, November 23, 2005 11:14 AM:
I found your "arguments in favor of" more convincing than your "arguments against".
If a developer wants an EqualString, it's trivial to make one:
boolean EqualString(string a, string b, ...)
{
return CompareString(a, b, ...) == 0;
}
so you're not protecting anybody by leaving it out.
# Stuart Ballard on Wednesday, November 23, 2005 11:36 AM:
Michael, thanks for the responses.
"Though I do not agree with all of the changes you are [s]uggesting in regards to equality comparisons needing to be ordinal always -- that still ought to be an explicit decision, for both SQL and the CLR."
I'm not so much suggesting they should be ordinal always. What I'm doing rather is recognizing the reality that 99% of the time people - including me, and even more so the other developers in my company - aren't going to even *think* about what collation should be used, and there's pretty much nothing I can do about that. Okay, I could make every single get function take a required "collation" parameter for every existing string parameter, but I think the other developers might rebel against that idea :)
I'm not trying to take away the ability to make an explicit decision if you want to, I'm just trying to make the case where you didn't think about it *at all* do what you expected. C# does that - when I do "==" on strings without thinking about it, it just works.
"you can also use *_BIN or *_BIN2 collations rather than casting to do that sort of comparison. The questions about SQL vs. CLR and their behavior differences are still there...."
How would you use the collations to get this query to do what I want it to do? "select from whatever where category_identifier = @cat and name like '%' + @search '%' order by name"
That is, the equality check should be ordinal but the like and the order by should be cultural. Is there a way to specify different collations for different clauses of the expression? And doesn't that impact whether relevant indexes will get hit or not?
As far as EqualsString, I agree with others who think there should be one. As far as I can see your performance argument relies on this:
"And it is hard to argue that "Microsoft must add something fast" if the person asking is actually not using something faster already...."
But creating sort keys for both strings and comparing them *isn't* necessarily faster, for all workloads. It only works if the same set of strings is going to be reused for many comparisons; the cost of creating a sort key by itself is greater than the cost of CompareString!
As far as I'm concerned, though, the compelling reason to have an EqualsString method is more to do with API-friendliness. Which chunk of code is easier to read and conveys the intention better?
if (EqualsString(str1, str2) &&
!EqualsString(str2, str3)) {...}
or
if (CompareString(str1, str2) == 0 &&
CompareString(str1, str2) != 0) {
}
(obviously I'm not a C++ programmer and I'm sure those functions actually take more parameters, even the mythical ones, but the point still holds).
I'm a strong believer in the idea that code should read like what it does. When what it does is test equality, that's what it should say.
# Michael S. Kaplan on Wednesday, November 23, 2005 11:46 AM:
Hi Maurits!
"return CompareString(a, b, ...) == 0;"
Hopefully you meant:
return CompareString(a, b, ...) == CSTR_EQUAL;
there? :-)
The protection comes from not encouraging an unsustainable design pattern....
I am not 100% convinced it is bad, just looking for scenarios when it would be correct to convinced that it belongs on a list of things to do in the future.
# Michael S. Kaplan on Wednesday, November 23, 2005 11:51 AM:
Tyhe real problem, Stuart, is understanding when an equality test is important to do -- and when it should acually be an ordinal/binary operation rather than a linguistic one.
My main concern is that you cannot just add a function to the NLS API set just because it would be cool -- there must be a reason, a valid scenario. Just about all of the cases I have seen, they should have been doing binary comparisons, not lingistic ones. I am still looking for the compelling scenario....
# Maurits on Wednesday, November 23, 2005 12:29 PM:
> return CompareString(a, b, ...) == CSTR_EQUAL;
You've got to be kidding.
OK, my bad for not reading the documentation. (This is Mom told me to always check my code...)
(/me reads
http://tinyurl.com/t4k4 )
Hmm, the return values are missing. :O
(/me reads
http://tinyurl.com/8qljr )
OK, now the return values are there, but as literal 1, 2, and 3, no CSTR_EQUAL. :&
(/me searches for CSTR_EQUAL)
Ah, here we go:
http://tinyurl.com/djoev
Yes, I guess I did mean == CSTR_EQUAL there. Although I might want to check for 0 first to call GetLastError() somehow.
# Stuart Ballard on Wednesday, November 23, 2005 12:44 PM:
I think I'm missing something in your line of argument.
On the one hand "Just about all of the cases I have seen, they should have been doing binary comparisons, not lingistic ones."; on the other "I do not agree with all of the changes you are [s]uggesting in regards to equality comparisons needing to be ordinal always -- that still ought to be an explicit decision".
Are you saying that the answer is almost always ordinal but people should still *think* about the question of what collation to use before doing any string comparison or equality test?
If that is what your saying, I agree that in an ideal world it would be great, but I can't see it happening in practice most of the time :(
# Michael S. Kaplan on Wednesday, November 23, 2005 1:04 PM:
Hi Stuart,
I guess I'm saying that adding the function would make it too easy for linguistic comparisons to be done when they should not be -- so maybe adding it would be a bad idea.
But the behavior of comparisons/equality checking cnnot be changed once it is implemented as that wozuld break all existing clints.
Perhaps an EqualStringOrdinal function is needed here. :-)
# Maurits on Wednesday, November 23, 2005 3:36 PM:
# Michael S. Kaplan on Wednesday, November 23, 2005 3:58 PM:
# Dean Harding on Wednesday, November 23, 2005 7:07 PM:
@Stuart
I believe Michael has described something like this before, but I can't find the original post. Anyway, you can use computed columns to use different collation rules for the same data. And because you can index computed columns (at least, you can index deterministic ones) then you can still do it in such a way that indexes are used correctly. The only thing would be the extra storage required for two indexes over the same column (though you couldn't really expect anything else).
Here's some example T-SQL to show you what I mean:
CREATE TABLE [coll_test]
(
[text_column] NVARCHAR(100) COLLATE Latin1_General_CI_AS,
[text_column_BIN] AS [text_column] COLLATE Latin1_General_BIN
)
CREATE INDEX IX_text_column ON [coll_test] ([text_column])
CREATE INDEX IX_text_column_BIN ON [coll_test] ([text_column_BIN])
Then you could write your query as:
SELECT * FROM [coll_test] WHERE [text_column_BIN] = 'whatever' OR [text_column] LIKE '%search%' ORDER BY [text_column]
Of course, it's a bit more work to get right, but it could be contained completely within your ORM code. Also, as Michael has pointed out before, SQL Server locale-sensitive collations are still different to CLR locale-sensitive collations, so there may be more issues still.
# Michael S. Kaplan on Wednesday, November 23, 2005 7:22 PM:
Please consider a
donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.
referenced by
go to newer or older post, or back to index or month or day