by Michael S. Kaplan, published on 2011/06/21 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/06/21/10177222.aspx
It is common knowledge to those not guilty of my dear boy type offenses that native, Win32 NLS pre-dates the managed System.Globalization classes by several years.
And it is perhaps not as completely well known but still fairly common knowledge that the principal developer of the former was in fact the initial architect of the latter.
It is also pretty common knowledge that the underlying data of one has always a minimum been an extension of the other, eventually leading to a common data store and format, and a not insignificant amount of code sharing.
But even knowing all that, it is easy to forget some basic compatibility issues that exist between these two fraternal twins.
Like just yesterday when a tester asked:
Hi,
A component writes a date string using GetDateFormatEx() API with DATE_AUTOLAYOUT:
GetDateFormatEx(
LOCALE_NAME_INVARIANT,
DATE_AUTOLAYOUT | DATE_SHORTDATE,
SysTime,
NULL,
dateStr,
MAX_PATH,
NULL))
This produces a string containing 'LEFT-TO-RIGHT MARK' (U+200E) characters:
DateTime.Parse() for this string is failing with “String was not recognized as a valid DateTime” because of this extra chars.
What is the correct way to parse the date string (in C#)??
Now GetDateFormat[Ex] has had this support in some form for quite a while:
Value | Meaning |
---|---|
DATE_AUTOLAYOUT |
Windows 7 and later: Detect the need for right-to-left and left-to-right reading layout using the locale and calendar information, and add marks accordingly. This value cannot be used with DATE_LTRREADING or DATE_RTLREADING. DATE_AUTOLAYOUT is preferred over DATE_LTRREADING and DATE_RTLREADING because it uses the locales and calendars to determine the correct addition of marks. |
DATE_LTRREADING |
Add marks for left-to-right reading layout. This value cannot be used with DATE_RTLREADING. |
DATE_RTLREADING |
Add marks for right-to-left reading layout. This value cannot be used with DATE_LTRREADING |
But the last decade of managed code support in the System.Globalization namespace has been unable to produce any version that will either use this functionality to format date strings.
And that same decade has failed to produce any code designed to parse strings produced via any of these flags.
Note that "support" for parsing would simply be adding the ability to ignore U+200e and U+200f, but supporting the parsing would certainly lead to a demand for support of the formatting.
Unfortunately, it is very common for tests of many different components to be written in managed code -- which means this question comes up a lot more often than one might expect, given the need to use these flags to get strings that will display properly....
The workaround?
You will need to walk the string, stripping out all instances of the following characters -- the first two in the table below are inserted by GetDateFormat[Ex] when passing any of the three flags above, the rest could be inserted by other, more sophisticated processes (or RtL language localizers doing their job):
Code point | Character name |
U+200e | LEFT-TO-RIGHT MARK |
U+200f | RIGHT-TO-LEFT MARK |
U+202a | LEFT-TO-RIGHT EMBEDDING |
U+202b | RIGHT-TO-LEFT EMBEDDING |
U+202c | POP DIRECTIONAL FORMATTING |
U+202d | LEFT-TO-RIGHT OVERRIDE |
U+202e | RIGHT-TO-LEFT OVERRIDE |
Now looking at the reason that we could really go more than a decade without managed code supporting something that native code added so long ago, there are a few (competing?) theories:
In the long run, given that there are such issues, it would be nice if some team just forgot about the politics and tried to solve the problems....
From a Microsoft standpoint, the number of groups that write automation that use managed code is significant enough that I think fixing these problems could be justified solely on a "being a good internal Microsoft citizen" standpoint. But maybe that's just me. :-)