Another interview question

by Michael S. Kaplan, published on 2006/02/02 07:14 -08:00, original URI: http://blogs.msdn.com/michkap/archive/2006/02/02/523189.aspx


Earlier today, in response to my post What is equal to some may not be equal to others, Christian Kaiser asked me:

Do you see a faster way to do a string comparison than to use CompareString(), just like NOTEPAD's programmer's idea, but correct?

It is a very interesting question, especially in light of the fact that this issue is being addressed in Windows Vista. So I thought I would open it up as one of those "interview questions" I ask from time to time.

I will temporarily moderate all responses so no one gives it a way too soon. I can't offer the best response a job, but we are still hiring so if someone wanted an interview I could offer to try to start someone through the process of contacting HR, etc.... :-)

Ok, on your mark, get set, go!


# Jeff Parker on Thursday, February 02, 2006 10:46 AM:

Hmm, the only thing that comes to mind right away is creating some dynamic regular expression to do the comparison. You can get really creative on regular expresions.

# Michael S. Kaplan on Thursday, February 02, 2006 11:03 AM:

Ok, sounds like a fair high level description -- though it is unclear how such a thing would be constructed..... more info needed, Jeff!

# Maurits on Thursday, February 02, 2006 11:28 AM:

A faster way than CompareString()?? In general, no, or else CompareString would just *use* the faster way. It sure would be nice if we could see the source to CompareString, to see how what we're trying to beat works.

But if you're going for speed...

Hmmm... given strings s1 and s2....

First, compare the pointers (s1 == s2 or ReferenceEquals) to see if they point to the same string. This is a cheap check, and if you get lucky, that's all you have to do. Some string storage libraries actually go out of their way to share memory for strings with significant overlap, so this may actually work.

Likewise, memcmp the byte arrays pointed to by s1 and s2. This is also relatively cheap of a check (though not as cheap as s1==s2), and if you get lucky, that's all you have to do. If you have a cheap way to check the lengths of the arrays, and they don't match, you can skip this check. (If the lengths are different, the strings could STILL be equal due to ligature unfolding and zero-width-joiners and that sort of thing.)

Beyond that you have to walk the strings and do equivalence lookups and case mappings and ligature expansions (shudder) which is best handled by CompareString(), no?

Worst-case this is slightly slower than CompareString(), but best-case this is MUCH faster.

# Maurits on Thursday, February 02, 2006 11:34 AM:

Hey what happened to my line breaks? OK, future posts will use "\n" to signify a line break.\n\n
Like that.\n

# Michael S. Kaplan on Thursday, February 02, 2006 11:35 AM:

Heh -- I am not really expecting a faster way that CompareString -- but a way that gives the same linguistic functionality while supporfting the find/replace semantic is the goal, I suppose.... :-)

The trouble is of course knowing how to do the comparison itself.... wihout requiring an impossible number of them, right?

# Maurits on Thursday, February 02, 2006 12:08 PM:

Oh, I see... the question is, given a text string t and a search phrase s, find the first occurence of s in t. But in such a way that Unicode normalization is respected.\n
\n
The naive way of course is to do something like
int firstoccurenceof
for (i = 0; i < s.length; i++)\n
{\n
\tfor (j = 0; j < s.length - i; j++)\n
\t{\n
\t\tif (CompareString(substring(s, i, j), t, ...) == CSTR_EQUAL)\n
\t\t{\n
\t\t\treturn i;\n
\t\t}\n
\t}\n
}\n
\n
but this takes O(nn) CompareString calls, which take on average O(nm) time each... resulting in an unacceptable O(mnnn) overall execution time. YUCK.\n
\n
One solution is probably to normalize both s and t first, then do\n
\n
for (i = 0; i < s.length - t.length + 1; i++)\n
{\n
\tif (CompareString(substring(s, i, t.length), t, ...) == CSTR_EQUAL)\n
\t{\n
\t\treturn i;\n
\t}\n
}\n
\n
This takes a much better O(n) CompareString calls, each of which takes O(mn) time for an overal O(mnn) time.\n
\n
Care would need to be taken to map the "i" returned by this function to where-it-belongs-in-the-original-string.\n
\n
Consider a string like "formulæ" - if someone searches this for "e", it should succeed... but what to do with the success?

# Maurits on Thursday, February 02, 2006 12:09 PM:

Oh, reverse the meanings of "s" and "t" in the code samples. s = "formulæ" (normalized to "formulae") and t = "e"

# Maurits on Thursday, February 02, 2006 12:10 PM:

Oh, and the improved algorithm takes a much better O(n) CompareString calls, each of which takes O(mm) time for an overal O(mmn) time.

# Maurits on Thursday, February 02, 2006 12:28 PM:

Someday I'll learn to count. CompareString() takes O(n + m) time, not O(nm) time. (Technically O(max(n, m)) time, but that's the same as O(n + m))

So the first algorithm with O(mn) calls of O(m + n) each takes O(mmn + mnn) time. For small values of m this is O(nn). For m and n about equal this is O(nnn).

The second algorithm with O(n - m) calls of O(m) each takes O(mn - mm). For small values of m this is O(n). For m and n about equal this is O(n)!

Not every day you can go from an O(nnn) algorithm to an O(n) algorithm.

# Michael S. Kaplan on Thursday, February 02, 2006 1:10 PM:

There is also that issue with compressions, and expansions, neither of which can be normalized out....

Though if neither string has expansions (or if you use FoldString to expand the ligatures), and if the local in question has no compressions, then this should do.

(if only those weren't such big ifs!)

# Maurits on Thursday, February 02, 2006 1:28 PM:

What are compressions and expansions? You mean things like æ => ae, and å (one code point) => å (two code points?)

My intention was to expand these out in both strings, prior to doing the search (as part of "normalization", which is perhaps not the best term to have used!)

The details get a little fuzzy though, because I need to be able to reverse-map found indexes to the original string, and it's not so obvious to me how that should best be done.

So in pseudocode:
sExpanded = expand(s)
tExpanded = expand(t)
iExpanded = find(sExpanded, tExpanded) // tExpanded is at position i in sExpanded
i = some_magic_happens_here(iExpanded, sExpanded, s)

find() is implemented as above...
for some_magic_happens_here, I think I would have to walk both s and sExpanded, keeping in sync, until I had gone i steps along sExpanded. Then i is wherever-I-happen-to-be in s.

This will probably going a character at a time in s, and eating larger and larger bits of sExpanded until they're no longer equal (by CompareString's standards.)

When I get to where I need to be in sExpanded, I return where I am in s, and hey-presto, that's the beginning of the selection.

Returning the end of the selection involves similar magic.

# Maurits on Thursday, February 02, 2006 1:37 PM:

All right, what's with the line breaks? I don't see a pattern.

# Michael S. Kaplan on Thursday, February 02, 2006 1:37 PM:

The expansions are indeed like that -- we have a table of them, and it is exposed through FoldString (though locale specific diferences are not, which is a problem for things like Icelandic -- muh harder to keep track of where you are in the string!).

Compressions are also things like ch in Spanish or dzs in Hungarian -- I have called these "sort elements" as they are treated like a single letter for collation purposes. So it is not just the canonical equivalence cases, which canas you mention be normalized out....

# Michael S. Kaplan on Thursday, February 02, 2006 1:38 PM:

The official issue with line breaks is one for which I have determined the pattern.

I will post about it later today. :-)

# Maurits on Thursday, February 02, 2006 1:55 PM:

Oh, I see... let me see if I understand the user expectation in a reasonable fashion.

So a search for "es" should succed in Æsop (though what the selection would look like is a harder question) because Æ expands to AE.

But a search for "hi" should fail in Chile (in a Spanish locale) because Ch is a unit, and cannot be broken up.

So I need to modify the algorithm to check if I'm in the middle of a sort element while stepping through s, and if I am, skip CompareString calls.

Also I need to make sure t doesn't end in the middle of a sort element (in s.) So a search for "uc" should also fail in "chaucha"

Or maybe that's pushing the bounds of user expectations?

# Maurits on Thursday, February 02, 2006 2:10 PM:

No, I can't go along with that. I think user expectation is that "uc" should succeed in "chaucha", and "es" should succeed in Æsop, and I might even go so far as to suggest that "se" should succeed in grüßen.

Simply ignoring sort elements should Just Work (TM), no?

# Maurits on Thursday, February 02, 2006 2:23 PM:

... by which I mean, as long as I'm careful not to step over the "h" in Chile, it shouldn't matter that "Ch" is a sort element as far as CompareString is concerned... because CompareString will treat "Ch" as a sort element equally in both strings, and when I'm checking the "h" part it won't see the "c".

# Michael S. Kaplan on Thursday, February 02, 2006 3:16 PM:

Some of those things do not work as users MIGHT expect *today*, though -- expectations here are unclear.

cf: http://blogs.msdn.com/michkap/archive/2005/03/11/394359.aspx
cf: http://blogs.msdn.com/michkap/archive/2006/01/23/515987.aspx

It makes expected "find" semantics more difficult to contemplate sometimes. :-)

# Maurits on Thursday, February 02, 2006 4:27 PM:

As an interviewee, then, I would propose making the sort element behavior an option...

Match fractional sort elements ON:
isin(uc, chaucha) = true
isin(z, dzs) = true
isin(ch, chaucha) = true
isin(dzs, dzs) = true

Match fractional sort elements OFF:
isin(uc, chaucha) = false
isin(z, dzs) = false
isin(ch, chaucha) = true
isin(dzs, dzs) = true

... and then proceed to demonstrate the implementation of both options.

# Robert on Thursday, February 02, 2006 7:52 PM:

Some time ago I implemented a FindString function which in most cases takes only O(n) time (around 1.5*n CompareString calls most of which return immediately). In the end I didn't use it because I wasn't sure whether the relevant statement in the CompareString documentation can be relied on in a strict sense: "If the two strings are of different lengths, they are compared up to the length of the shortest one. If they are equal to that point, then the return value will indicate that the longer string is greater." More specifically, the function fails for TCHAR strings that are lexically before any of their substrings (from the beginning). For example, when looking for "á" = {U+00E1}, the function will not find the "á" = {U+0061, U+0301} representation, if it sorts before "a" = {U+0061} in the specified locale. In other words: The function assumes that CompareString(lcid, flags, string, m, string, n never returns CSTR_GREATER_THAN if m <= n and the strings agree in the first m TCHARs. LPTSTR FindString( // Return value: pointer to matched substring of text, or null pointer LCID Locale, // locale identifier for CompareString DWORD dwCmpFlags, // CompareString flags LPCTSTR text, // text to be searched int ltext, // number of TCHARs in text, or negative number if null terminated LPCTSTR str, // string to look for int lstr, // number of TCHARs in string, or negative number if null terminated int *lsubstr // pointer to int that receives the number of TCHARs in the substring ) { int i, j; // start/end index of substring being analyzed i = 0; // for i=0..end, compare text.SubString(i, end) with str do { switch (CompareString(Locale, dwCmpFlags, &text[i], ltext < 0 ? ltext : ltext-i, str, lstr)) { case 0: return 0; case CSTR_LESS_THAN: continue; } j = i; // if greater: for j=i..end, compare text.SubString(i, j) with str do { switch (CompareString(Locale, dwCmpFlags, &text[i], j-i, str, lstr)) { case 0: return 0; case CSTR_LESS_THAN: continue; case CSTR_EQUAL: if (lsubstr) *lsubstr = j-i; return (LPTSTR)&text[i]; } // if greater: break to outer loop break; } while (ltext < 0 ? text[j++] : j++ < ltext); } while (ltext < 0 ? text[i++] : i++ < ltext); SetLastError(ERROR_SUCCESS); return 0; }

# Michael S. Kaplan on Thursday, February 02, 2006 8:05 PM:

The docs about string length are actually wrong, sorry!

Your function, slightly more readable:

LPTSTR FindString( // Return value: pointer to matched substring of text, or null pointer
LCID Locale, // locale identifier for CompareString
DWORD dwCmpFlags, // CompareString flags
LPCTSTR text, // text to be searched
int ltext, // number of TCHARs in text, or negative number if null terminated
LPCTSTR str, // string to look for
int lstr, // number of TCHARs in string, or negative number if null terminated
int *lsubstr // pointer to int that receives the number of TCHARs in the substring
)
{
int i, j; // start/end index of substring being analyzed

i = 0;
// for i=0..end, compare text.SubString(i, end) with str
do {
switch (CompareString(Locale, dwCmpFlags, &text[i],
ltext < 0 ? ltext : ltext-i, str, lstr))
{
case 0:
return 0;
case CSTR_LESS_THAN:
continue;
}
j = i;
// if greater: for j=i..end, compare text.SubString(i, j) with str
do {
switch (CompareString(Locale, dwCmpFlags, &text[i],
j-i, str, lstr))
{
case 0:
return 0;
case CSTR_LESS_THAN:
continue;
case CSTR_EQUAL:
if (lsubstr) *lsubstr = j-i;
return (LPTSTR)&text[i];
}
// if greater: break to outer loop
break;
} while (ltext < 0 ? text[j++] : j++ < ltext);
} while (ltext < 0 ? text[i++] : i++ < ltext);

SetLastError(ERROR_SUCCESS);
return 0;
}

# Ben on Thursday, February 02, 2006 8:10 PM:

Robert: You are right that CompareString makes it hard to tell if something is a prefix. That's why they created FindNLSString() for Vista. The FIND_STARTWITH flag implements prefix matching correctly.

# Michael S. Kaplan on Thursday, February 02, 2006 11:49 PM:

Ben, that is very true -- and this is one of those operations that is much easier on the inside of the function than the outside! :-)

# Eric on Friday, February 03, 2006 4:40 AM:

I have to dissapoint you, Maurits. Ch is no longe a valid letter in the Spanish alphabet. It used to be, but 10 years ago (or something like that, I have really bad memory) they decided to split them apart. So if you are looking for "hi" in "Chile", it should find it (with starting index 1).
Same goes to the rest of our (ex) double consonants ("ll" and "rr").

# Michael S. Kaplan on Friday, February 03, 2006 7:58 AM:

Hi Eric --

It is true that unless you use the 'traditional spansh' LCID that ch and the others will not be compressions. But Windows handles this correctly and it is easier to understand the concept with examples.... :-)

# Petr Kadlec on Friday, February 03, 2006 10:19 AM:

Eric+Maurits: If you are looking for another example, we still have Ch as a single digraph in Czech, and I hope we won't dicard it anytime soon. ;-)

# Maurits on Friday, February 03, 2006 11:10 AM:

Is Cz a compression in Czech too?

# Maurits on Friday, February 03, 2006 12:10 PM:

(posting this again as the first attempt met with an error, and was not visible after re-visiting the thread)

To answer my own question... no, it isn't...

Wikipedia has a list of potential digraphs and trigraphs, from which many Hungarian compressions are conspicuously absent:

http://en.wikipedia.org/wiki/Digraph_%28orthography%29
http://en.wikipedia.org/wiki/Trigraph_%28orthography%29

The "schtschj" octograph (German equivalent for Russian щь) is also mentioned, but perhaps with tongue slightly in cheek :)

# Maurits on Monday, February 06, 2006 4:53 PM:

Alright, I'm beginning to see the "don't match partial compressions" side of the argument.

referenced by

2006/07/15 How long is that non-Unicode string?

go to newer or older post, or back to index or month or day