On Compare ≠ Equals, and the Relevance of both

by Michael S. Kaplan, published on 2011/05/10 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/05/10/10162158.aspx

Including many that reminded me that not every reader has been around since the end of 2004 when I started writing here.

And that of that huge set of people who have been here for a time less than those nearly six years, very few of them have gone back and read deeply prior blogs on subjects that they might find interesting.

So the vast majority of people here have no idea what I was taking about in that blog!

So today is some quick review to get people back on that page that I was mentally assuming they were in already....

The whole issue centers around the fundamental difference between two questions:

Now, at a fundamental level, you should NEVER ignore case, diacritics, width, kana, or any other difference in question #1. EVER.

You should always have a deterministic ordering, even in items you would nominally consider to be equal in some way.

Because the truth is they will still be right next to each other and the only difference is that the order will indeed be unchanging and deterministic. \

And then even if the sorting methid you use isn't transitive, the order of the list won't "change" if you re-sort it.

But in #2, you should ignore anything you consider it irrelevant to the question you are asking.

Ignoring increases the size of the list of things you are looking for, and therefore casts a wider net.

Thus (for example) if you are searching for a list of names in a contact list, you might want to ignore a bunch of stuff so that you will get more contacts in your search results.

These two fundamental questions are the basis of the shorthand about saying that Compare ≠ Equals, since Compare relates to the first question while Equals relates to the second.

And so the fact that an "identity" question has different results than an "ordering" one is by design, and despite the fact that it is only intuitive to a few of us doesn't mean it's not true....

Now there is a third word I'm going to add here, one that is very topical since I threw out an example involving search,

Now Exchange and Outlook give an interesting example how to misapply these concepts.

When you search in the Global Address List (GAL), the results come back based on a "starting at the beginning" substring Equals result, and displayed in an order using a Compare operation.

And it starts you at the beginning of the list, which is as good as they can get when it comes to Relevance.

The Equals part is okay, and as long as results use the client side language setting then the Compare side is okay too.

Imagine if Google and Bing brought back the their lists of results alphabetically!

Now Bing has made some waves attacking the Relevance issues -- like figuring out when a search is about buying stuff or seeing shows -- and Google worked to try to catch up because they knew this would be a big thing to miss for users. Both of them now do a good job pointing out directions as an option when it makes sense for that same reason.

Now the GAL problem I mentioned earlier is an interesting one -- what would you do to try to make the list more relevant?

Note that file search has the same problem in a lot of places -- in Windows Explorer, in Visual Studio's "Find in Files" and so on. Trying to figure ways to solve the relevance problems here could actually help people a lot.

When you consider the central importance that Google and Microsoft place on search, doing better at all three operations would do us all some good.

What are ways you imagine any product or company could be doing better than they are now? There is no shortage of potential areas of improvement!

But it is important to alwys remember that they should never be confused with each other....

"Now, at a fundamental level, you should NEVER ignore case, diacritics, width, kana, or any other difference in question #1. EVER.

You should always have a deterministic ordering, even in items you would nominally consider to be equal in some way. "

Yeah, but unless your records are bare strings, it's perfectly reasonable to want to consider column B before you consider the case/diacritics/etc of column A.

And it's possible to use a stable sorting algorithm (what you call 'transitive', though I think the word you were looking for was 'idempotent') deliberately - it makes it easier to interactively sort by multiple columns by clicking on headers, for example.

Sort keys are built such that you are always considering the primary weight before the secondary weights. You do *not* need to ignore anything to get this behavior.

You're missing the point. The problem is when you want to consider the primary weight, then _something unconnected to the string_ [maybe it's the primary weight of some other string, maybe it's a numeric or date field], and then maybe the other weights. So either you have to ignore the other weights, or you have to crack open the sort key.

And you can argue that you should always _eventually_ come back to the other weights, but if one of the other fields you're looking at is a unique key [and 'original index' counts if you have a stable sorting agorithm], there's no need to go beyond that.

In such cases, a better sort key should be built up. Otherwise you are building in your inefficiencies, which makes them no more efficient!

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.