On approaching international programming....

by Michael S. Kaplan, published on 2005/04/14 11:00 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/04/14/408116.aspx

Yesterday, someone named Ben posted the following comment to my post Invariant and Ordinal Redux:

I appreciate your enthusiasm for picking out common programming errors like this, but as a professional programmer, I find a lot of these internationalization parameters confusing.
How do I know if I need to pass the NORM_IGNOREKANATYPE flag to CompareString? How do I know if I want LOCALE_USER_DEFAULT or LOCALE_SYSTEM_DEFAULT, or some other locale?

I simply don't know. Unless I learn Japanese, or know someone who knows Japanese, I'll never know the answer. The trouble is that the APIs feel like they were written by linguists.

Me? I just want to compare filenames, or compare entries in a hash table, or compare usernames, etc. I don't want to even have the choice of ignoring kana types. I just want the CompareStrings API do the *right thing* out of the box. If that is too hard for a single function, then let's write some API sets that are easy to use for common cases. I think this would be a more useful endeavor than to write articles about the nuances between CT_CTYPE3 and CT_CTYPE2.

Sometimes less choice is better. Please please finish that list of do's and don'ts. Please please make a list of "If you want to sort like a dictionary, do this... If you want to put filenames into a hash table, do this..."

My initial reaction was to point out that the APIs were not written by linguists -- but the developers had expert advice from linguists when the functionality was exposed.

My second reaction was a technical one, thinking of which ones I had already covered (like What is my locale? Well, which locale do you mean? answering some that locale question) and which ones might make good future posts (like the care and feeding of NORM_IGNOREKANATYPE) and so on.

My third reaction was to slow down this "developer" in me trying to solve the technical problem and look to what was really being suggested. Unfortunately, Ben's supposition is correct -- the APIs are complicated, and there is too much functionality to try to distill into simple usage without having detailed articles about the nuances. Articles that could be read by the kind of devs who try to solve the problem you indicated.

In a very real and almost biblical sense, one can talk about "CompareString which begat lstrcmp and lstrcmpi in the USER kingdom, and was fruitful an multiplied in the SHELL kingdom and begat StrCmp, StrCmpI, IntlStrEqN, IntlStrEqNI, StrCmpN, StrCmpNI, StrIsIntlEqual, some of whom later begat StrCmpLogicalW. And in that kingdom functions which were not begat from CompareString also flourished like those that used the C rules -- StrCmpC, StrCmpIC, StrCmpNC, and StrCmpNIC. And in the kingdom of .NET the managed brother CompareInfo was also fruitful and begat the five overloads of String.Compare and in Whidbey begat the StringComparer class and the StringComparison enumeration. And CompareInfo.IsPrefix and its overrides begat String.StartsWith. And CompareInfo.IsSuffix and its overrides begat String.EndsWith. And..."

Of course what the SHELL folks and the BCL folks did showed that in attempting to simplify individual functionalities into single APIs, you cause an explosion of simple APIs that are also very tough to unravel what to use.

Topically modifying what Hal Holbook said on The West Wing (playing the cantankerous Albie Duncan) in the episode Game On:

It's not simple. It's incredibly complicated. I've been doing NLS work for over 10 years and there is no right answer to these questions and software development needs all the words it can get its hands on...

I could tell you when it is ok to use lstrcmp and lstrcmpi and StrCmpLogicalW.  I could not even try to tell you how to navigate the rest of that stuff in the Shell or a lot of the stuff in .NET, even though a lot of it calls right into us. Because to me it is just a decision of whether one wants one's complexities to be horizontal or vertical, with the bonus of the vertical complexity (the NLS kind) being that all of the functionality is there, versus the individual McNugget that the developer was trying to surface in the simplified method, which will always be missing one or more of the functionalities that are possible, despite seeming to me to be a lot more complex....

So while I will give practical advice from time to time like (like "use the new OrdinalIgnoreCase type comparisons when trying to imitate the OS, because the OS does not know CompareString from Cholesterol"), the bulk of what I say will be exploring that vertical space of the NLS managed and unmanaged APIs and how best to use them to get the results you want.

Because the problem I have personally with the horiztonal space is that when you have to change behavior because the call did not do what you thought it did, the change is more than just passing a new flag; it is often calling a whole new function in a whole new way (just take the String.StartsWith method as an example -- if you want to do some operations you have to move to CompareInfo.IsPrefix, which has entirely different calling semantics (one is a static method that takes two strings, the other is an instance method on a string). Or if I want to change the STRINGSORT/WORDSORT behavior of StrCmp, I have to go figure out all the parameters of CompareString now, which if I had done in the first place I would not have been trapped in the Sargasso of SHLWAPI.

Hopefully this fits with the model people are expecting here. If not then maybe the Shell or BCL folks will step up and work to provide the uber-conversion charts to know when to call which of the 30 methods that are all designed to simplify the five methods that NLS provides (or in the unmanaged world the 30 functions designed to simplify the one function).

Simplification is just too complex for me. :-)


This post brought to you by "A" (U+0041, LATIN CAPITAL LETTER A)
After Happy Days went off the air and everybody realized the Fonz was short, the letter behind "Aaaaay" had its reputation injured a bit andis looking to expand into new markets, like this blog!

# Barry Kelly on 14 Apr 2005 10:05 AM:

I think that some hard-and-simple rules could be learned and used by people whose jobs relate to a more focussed subset of functionality.

I'm a application server developer: I write the application server. The language / regional settings of the current OS installation mean very little to me. When I compare strings, I want to do so under two fundamental conditions:

1) Comparisons for internal object names. Things that the application developer writing applications using my application server and framework might like to use their own language for, but must act identically on another machine with a different developer / language settings set.

2) A configurable bundle of client-specific settings which is associated with the current client request / response cycle.

3) A configurable bundle of server-specific settings which is associated with a given application living in the application server framework, and determines how the server will interact with things local to it, like the database server (date formats, time-zone settings, and other wonderful friends).

It seems to me that their could be objects (let's call it Context to avoid wars, containing everything from culture / date formatting to language and regional settings) available "out there" with methods like Compare, etc. which encapsulates each one of these programmer use cases, but overridden to suit the correct semantics. Ideally, they would be as opaque to the programmer as possible, and have a simple serialization format and OS/Framework provided editor (think of the ACL editor in the OS) so that application administrators could make most of these decisions, as long as the application programmer is using the correct object in each case.

So, at the start of my application server in my example above, I would load up two Context instances, one for internal operations (defined internally) and another for server operations (defined by the application administrator).

At the start of every request / response cycle, I'd load up an appropriate Context for the location of the client, and attach that to the request.

Some kind of scheme like this, a one-stop shop where you make all the relevant decisions for a given use-case *once*, rather than over and over again,
would seem to me to be far preferable to the current specialist approach, where bits and pieces associated with the hypothetical Context need to be cobbled together from various locations.

# Michael S. Kaplan on 14 Apr 2005 10:33 AM:

Hi Barry!

Interesting information, but of course for a highly specialized domain. I see two problems:

1) The actual implementaion will not be this simple, as there will need to either be many configuration options to tweak, or you will force many users to have to go back to the old way of doing stuff. Andd as a bonus frustrated them even more han if the APIs were explained in the first place!

2) If such a framework were added on top of all that is there then it is very similar to what the Shell folks and the .NET folks have done, adding yet another way to do stuff, which will be great for some and useless for others. And yet another set of items on the list that everyone has to deal with, which adds to the complexity. :-(

Though trying to do something here would make sense, it is unclear how to do something that does not get added (like the Shell APIs) as one more way to do things that does not match every scenario.

# Dean Harding on 14 Apr 2005 7:47 PM:

I think it's just a fundamental problem with trying to please an international audience when you can't, in fact, know what they're expecting in the first place. I mean, I don't know all the intracasies of CJK so I don't know whether I should be passing NORM_IGNORECASE or NORM_IGNOREKANATYPE or NORM_IGNORENONSPACE when sorting a list for them, unless someone actually tells me...

I think the best we can hope to achieve as mere mortal developers, is to do the best we can, and be prepared to fix bugs when the international users start putting our assumptions to the test. :)

# Michael S. Kaplan on 14 Apr 2005 8:52 PM:

For *that* problem, there is no answer -- there is no API that will help a person understand every aspect of a language sbout which they know little.... :-)

# Dean Harding on 14 Apr 2005 9:20 PM:

That's right. Reading over my post before, I don't think I was very coherant (it was early morning here and I hadn't had my coffee yet ;).

I think what I mean is that because we don't know the expected behaviour in every case, then we can't be expected to get it right in the first place. If we can't expect to get it right in the first place, then we have to expect to fix bugs later on.

And I agree with you that having a single "complex" API which does everything is better in this case, because changing the behviour is a simple case of changing flags.

Also, the documentation is all in one place. It's much better (in my opinion) to say "OK, I want to compare these strings. That means CompareString! Now, which flags do I pass?" And then looking up the documentation of the CompareString function and seeing all the possible flags in one place. Rather than saying "OK, I want to compare these strings. Which function do I use? StrCmp? StrCmpI? StrCmpLogicalW?" And then having to look through the documentation for all those functions and trying to figure out which is more appropriate, and trying to figure out any other functions I may have forgotten about.

However, I think it's also important to have good defaults, so that at least we get things right in "most" cases :) Perhaps even having a list of common scenarios with the expected usage of each function would help greatly.

Hmm, I think that was a bit more coherent...

# Michael S. Kaplan on 14 Apr 2005 9:38 PM:

Definitely so -- very clear. And not just because you agree with me, though of course that never hurts.

At some point a post that relly goes through all the flags and their uses might be interesting....

referenced by

2006/05/04 Sort the words, sort the strings

2005/06/12 Browsing the shoals of managed string comparisons

go to newer or older post, or back to index or month or day