New in Vista Beta 1: FindNLSString (an 'internationalized' strstr)

by Michael S. Kaplan, published on 2005/07/31 19:10 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/07/31/445819.aspx

This is an example of the kind of features that we in NLS can add to a product -- not as fancy as transparency and other cool Vista stuff that gets all of the press coverage. But there is a certain class of people, a class with a big overlap with those who read this blog, who may find it to be quite interesting. I am not going to leak anything that is not available in the legitimate beta that may be in your hands right now (or could be some time soon!), so don't get too excited. But there are some very cool features that are going into Windows Vista that may be fun for geeks like me, so consider this the first of many such notices. :-)

The strstr function has been a part of the C Runtime for ages. It's simple job is explained in the docs: "Returns a pointer to the first occurrence of a search string in a string."

But of course that function (or its Unicode cousin, wcsstr) would never do any of the interesting fun things that CompareString is so famous for, from ligature equivalences (U+00e6 æ being equal to the letters ae for most locales) to Unicode canonical equivalences and more.

So for ~~Longhorn~~Vista we have added an NLS version of this long-existing functionality -- the FindNLSString function!

The Vista Beta 1 SDK will be available soon, so consider this a marketing preview of the new function. :-)

If you are a developer who has already picked up Beta 1 of Vista off of the MDSN servers, this function is exported from kernel32.dll and gives you all of the functionality of the managed methods off of CompareInfo (i.e. IsPrefix, IsSuffix, IndexOf, and LastIndexOf).

The new FindNLSString has one extra bit of functionality that neither wcsstr nor those managed methods have ever had before -- an OUT param that will allow the caller to find out the length of the string that was found (which may not be the same size as the search string!). Now if you think about what the FindNLSString function may be used for (a good example is someon using the ReplaceText common dialog to replace one string with another), what better way to mess up an operation than to not know of the length of the string that was actually found? I mean, it is all well and good for the Unicode standard to say that U+00e5 (LATIN SMALL LETTER A WITH RING ABOVE) is canonically equivalent to U+0061 U+030a (LATIN SMALL LETTER A + COMBINING RING ABOVE), but if your replace operations starts improperly detecting the subset then it will not be a very effective replace operation, now will it? :-)

Now one feature that has not been added is that there are no separate 'A' and 'W' functions -- there is just one Unicode version, without decoration. The trend that started in Windows Server 2003 with IsNLSDefinedString to only add Unicode versions of functions clearly looks to be the way things will be going forward for NLS. If you are not using Unicode, then you will want to realize that you are not going to see some of the features coming out in products.

Well, I did try to do that, and managed to break our private build with the change since there were so many cases of internal functions in components and utiities and Platform SDK samples named FindString. Maybe if we had reserved the name 15 years ago, we'd be all set. But even if I changed all of those cases, it is obviously something that would be a problem for users as well. Anything that is in our source code once is in customers' code hundreds of times, and I don't even want to think about how many times it would be in customers' code. Calling it FindNLSString keeps that overlap from being a problem....

This post brought to you by "å" (U+00e5, a.k.a. LATIN SMALL LETTER A WITH RING ABOVE)
A letter that is anxiously awaiting Vista Beta 1 so that all of its different normalization forms can finally be considered equal!

2006/12/09 On being consistently consistent, while still managing to be dead wrong

2006/11/20 Putting the *backward* in backward compatibility

2006/10/24 Sometimes in the future 'ANSI' is really going to be unsupported!

2006/07/09 The fallacy of comparing out of context

2006/05/24 Is it punctuation, symbol, or diacritic?

2006/01/25 'right' does not always equal 'smart' #1

2006/01/15 Falling over the edge of a conceptual collation cliff

2005/12/15 The bug you wish you'd caught?

2005/12/06 Sample usage of the FindNLSString function

2005/11/22 More on the fabled EqualString

2005/10/24 Searching for supplementary characters

2005/08/25 My kingdom for some Unicode controls