What would it mean to internationalize StrCmpLogicalW?

by Michael S. Kaplan, published on 2006/10/02 06:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2006/10/02/783066.aspx


If you are a regular reader of this blog then odds are you might be a little sick of hearing about StrCmpLogicalW for a little while. But I thought I'd bring up one more post about it anyway, one that actually brings internationalization back into the forefront....

The question to answer is how would one provide an internationalized version of this functionality in a future version?

It is not a trivial question, even ignoring the point I brought up previously about how if it were integrated into CompareString and/or CompareStringEx as a flag (i.e. NORM_SORTDIGITSASNUMBERS) that a solution to add it to sort keys (in LCMapString and/or LCMapStringEx) and to searching algorithms (in FindNLSString and/or FindNLSStringEx) might also be important since historically these three aspects of collation have been kept in sync from a functionality standpoint whenever it makes sense. And given the needs of actual applications that provides services for the Shell like Search, this is hardly a trviial point. Though it is one that means it would take more work to sign up for integrating the feature!

But getting back to the locale sensitivity issues, here are some questions to consider -- ones that would have to be solved before work could really begin on the feature:

  1. Many locales represent languages that have their own number systems. And it does make sense to consider that ٨٦٧٥٣٠٩ (8675309 using the Arabic-Indic digits of the Arabic script) is also a set of digits that could be treated as a number. But should it be treated as a number for all locales, or only locales that would return these digits after a GetLocaleInfo/GetLocaleInfoEx call with the LOCALE_SNATIVEDIGITS LCType value?
  2. If the answer one decides for #1 is that one should limit oneself to the passed-in locale's LOCALE_SNATIVEDIGITS, should the ASCII digits always be handled? If not, then the Shell folks might consider this to be a regression in functionality.
  3. Related to #2 is the question of whether a language like Persian would expect both ٨٦٧٥٣٠٩ and ۸۶۷۵۳۰۹ (the latter is the extended Arabic-Indic digits) or not. Or whether one would use ८६७५३०९ (the Devanagari digits) in all of the locales of India but not of Bangladesh? Or should all of the 'Indic' script digits always be recognized as numbers across all 'Indic' locales? And where do ๘๖๗๕๓๐๙ (that same number in Thai digits) and ໘໖໗໕໓໐໙ (Lao digits)  and ៨៦៧៥៣០៩ (Khmer digits) fit into all of this -- when we say 'Indic' do we mean 'Of India' or 'Of South Asia' ? Or are Thai and Lao to be grouped together? In other words, if #1 is decided as a limiting instruction does one still need to extend it, under sny circumstances?
  4. If the answer one decides for #1 is that all of the digits should be thoght of as numbers across all locales (not an unreasonable position given some of the messy points raised in #2 and #3!), would one consider a mix of scripts to be a number? I mean, is ໘६๗5៣೦൯ the number 8675309, or is it seven separate numbers, with the script changes representing a boundary between the pieces considered a number? And before you answer that one, where does that leave a mix of the Arabic-Indic and Extended Arabic-Indic digits? Are there cases that expect such boundaries to exist, when there are others that do not?
  5. Further to #4, what would one do with attempts to use Ethiopic numbers (discussed previously) as if they were digits -- is ፰፮፯፭፫0፱ to be considered a number for the purposes of this functionality? (Note the fact that there is no zero requires the ASCII zero to be used). The same question exists for number systems like the one used in Tamil that have both old and new style usages....
  6. In Vista, all of the weights of the numbers have been altered so as to group all of the like digits (given both the need to address space constraints on weights and that it makes more sense to e.g. group all the 2's together rather than having ASCII 9 < Gurmukhi 2), with a diacritic difference between them. If one chooses to ignore diacritics with the NORM_IGNORENONSPACE flag, should one then fold all of the digits together even if one chose to not treat all digits across all scripts in #1? Would it be expected to change the answer to the mixed script question, making the Telugu ౮౬౭౫౩౦౯ and the Oriya ୮୬୭୫୩୦୯ and the ASCII 8675309 to be treated all as equal numbers? This can be especially interesting for FindNLSString and/or FindNLSStringEx asnd the question of being able to search for numbers.
  7. Would it make sense for the Shell to start using LOCALE_SNATIVEDIGITS for its numbering of copies of files, and should we be pushing for this?
  8. What about hexadecimal digits? Do we care?
  9. Are there additional locale-specific behaviors to capture here like spelling out numbers or ordinals, and would they make sensee to capture as well? And if so, how does one compare to first, and how does first compare to 1st, and how do both compare to d'abord or un or 1er, to πρώτα or primero or primeiramente or во первых or in primo luogo? And so on?

Now I am not going to fall back on implementation issues driving all of the decisions here (though I could easily see some point #8 either postponed for its complexity or just dismissed as being out of scope!). Most of the decisions here are simply ones that would have to be made, after which the implementation plan is simply something to be carried out.

But the actual questions for what would be the prefrences for an internationalized version of StrCmpLogicalW are real ones that need real answers. How would you expect such a feature to work?

 

This post brought to you by (U+0beb, a.k.a. TAMIL DIGIT FIVE)


# RubenP on 2 Oct 2006 6:35 PM:

My gut tells me it would be least confusing to group digits by script, so you won't get i, 2 iii, 4, but 2, 4, i, iii. That option also makes several other points a lot easier. And, supporting all locales at once will not easily confuse anyone.

I would not sort spelled out numbers and ordinals according to their value; something like that would at the very least require some AI to be implemented by the shell team, if you think of the English one [1 or a person], Dutch een [1 or indefinite article], or French un(e) [idem].

And skip the hex digits. I mean, which insane people use hexadecimal as a numbering system in real life? Right: programmers. Don't listen to us.

# Michael S. Kaplan on 2 Oct 2006 7:35 PM:

Hmmm.... this does lead to the weirdness of small digits coming after larger ones for Arabic/Extended Arabic and other such cases. Plus, since there is not enough space to give them all unique Unicode Weights, you'd have to make all the numbers within a script be equal -- thus (NORM_SORTDIGITSASNUMBERS | NORM_IGNORENONSPACE) would mean that "9" == "1" and so on.

I have a feeling that saying that two different script versions of the number 9 would be easier to convince people of than having two numbers in the same script....

# Centaur on 3 Oct 2006 1:21 AM:

Computer sorting does not work like human sorting anyway, and never will. I will demonstrate this on a book example.

Suppose you have six books (and expect a seventh) whose titles all start with the same four words, and the remainders are:

* Chamber of Secrets

* Goblet of Fire

* Half Blood Prince

* Order of Phoenix

* Prisoner of Azkaban

* Sorcerer’s Stone

Suppose you put them on your shelf sorted alphabetically. Then someone comes over and says, WTF! Surely Sorcerer’s Stone goes before Chamber of Secrets before Prisoner of Azkaban and so on.

Now suppose you have a database of your books, and you want it sorted by author, then by series, then sequentially. If the DB scheme designer was smart, he has provided two additional fields, say, SERIES and SEQ_NUMBER, that you can fill in TITLE='Sorcerer’s Stone' as being SERIES='Harry Potter' SEQ_NUMBER=1, and for all books that are not part of any series have SERIES=TITLE, so it all sorts nicely.

However, if the database was just sloppily hacked together in half an hour in Excel (which feels horrible to anybody who’s done one’s share of database design, but is a very frequent abuse of Excel in the land of Non-Geek Computer Users), it will most probably just have columns for Author and Title, and then you have to resort (no pun intended) to adding an artificial sort key in front or in the middle of the title, like “Harry Potter 4: The Goblet of Fire\tRowling, Joanne Kathleen”.

Of course, this is only the beginning. Some series actually have at least two distinct sequences — for an example, google for “chronicles narnia chronological order” or maybe “star wars”. Without an external reference, it is impossible to say whether “The Phantom Menace” should come before or after “A New Hope” (even assuming we have an algorithm of giving no weight to leading articles).

Thus, I posit that any attempts to sort strings of unknown nature in a human-like fashion are doomed to fail.

# Michael S. Kaplan on 3 Oct 2006 3:39 AM:

Hmmm.... well, to some extent I will agree. But I disagree that the concept of an alphabetical order is flawed, or that trying to extend it a bit is a non-intuitive or undesirable idea.

# RubenP on 3 Oct 2006 6:20 PM:

Hmmm. It might just be that we're using different interpretations of the word "script"; to me, basic Arabic and Extended Arabic are the same script. So it's a script with two times number 9. In this particular case, I don't see this as a problem.

On the other hand, when you'd split sorting between these types of numerals, I don't think you've got a problem either, because the one off chance of anyone mixing various numeral systems and still expecting everyting to be sorted based on numeric value, is probably very small. It's not unlogical to see things like "these are the files sorted by Latin digits, and these are sorted by Arabic digits". Mixing them would create "mixed" results, for lack of a better pun.

I do see some problems:

- Roman numerals are Latin just like the ASCII digits (although the Unicode roman numerals are not ASCII, so you could think of these as a different script); interpreting i as 1 is tricky, for eg. English, just like vi or cd, so you probably shouldn't.

- Are half width numbers the same as ASCII digits? Tricky one here, but as the digits are basically equivalent, and we're not stuck on proportional fonts per se, I don't see much harm here.

Note that I'm talking about equivalences here, whether the algorithm is implemented through sort weights or some hand tuned code is, frankly, a little beyond my expertise.

# Michael S. Kaplan on 3 Oct 2006 6:26 PM:

We would only be doing this with numbers that are DIGITS in the Unicode sense, so there are no worries with other number types as we would not be handling those the same way....

But it is (like I said) a hard choice to be made -- it is better to fold all the 2's together than 0123456789 together. So we did make a choice based on implementation issues.

# Igor on 4 Oct 2006 6:52 PM:

I am for a radical aproach on this one.
Convert all numbers to ASCII -- problem solved :)

# Michael S. Kaplan on 4 Oct 2006 7:06 PM:

Ths is actually what would happen if you passed both flags here -- all digits will be treated as if they were like the regular ASCII 0-9.

So not that radical. :-)

The question is what to do the rest of the time!

# benkaras on 5 Oct 2006 4:07 AM:

Have you posed the suggestion to internationalize StrCmpLogical to the shell team?  Maybe they'd be intrigued (and the rest of the world happy when their files sort more intuitively!)

# Michael S. Kaplan on 5 Oct 2006 7:48 AM:

We have -- they'd like us to do it, so they can call us (of course I pointed out how much faster that call would be from inside of CompareString than from outside of it, which may be what really intrigued them!)....


referenced by

2011/08/22 I've got your number! Here's how...

2008/02/23 The triage process gives me hives

2008/02/13 Canada isn't Kannada, ay (ಎ)?

2007/12/22 Incomplete Scenarios: They don't know everything that's up with number sorting

go to newer or older post, or back to index or month or day