When you assess, you make an...

by Michael S. Kaplan, published on 2008/07/24 03:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2008/07/24/8768909.aspx

The question was deceptively simple:

Is String.SubString complex script safe? Can we use substring on a localized string safely?

Now the shape of the question itself hints at the concern -- by asking about complex scripts, the question about String.Substring is being framed in terms of combining characters, with the question being whether String.Substring is smart enough to know not to chop off dependent/combining characters.

Well, the obvious answer is easy -- it isn't.

String.Substring is based UTF-16 code units and as long as things fall within those boundaries, it can/will split them up any way it is asked to, without warning.

Once again, there is an easy answer, the one I talk about in posts like:

The StringInfo class has the methods and properties to properly respect the character boundaries the question is talking about.

Note of course that this won't do anything with compressions (contractions) used in sorting, but we'll leave that one lie for now.

Let's think more closely about the question for a moment:

Can we use substring on a localized string safely?

If we take the word localization as the much more careful and enlightened version of translation, where ideally all of the relevant issues such as language, regional variation, market expectations, and so on are all considered, can any automated process that chops on character boundaries be considered "safe" for the purposes of localization?

For example if I truncate

You must then watch her assessment of the project

at 27 characters using the StringInfo style safety guarantees to meet some arbitrary buffer requirement using StringInfo to not break the user's character boundaries, you will get:

You must then watch her ass

and then you'll be really sorry that the English version isn't localized so that a localizer could take one look and realize some developer was once again being clever rather than being smart!

Now do we feel better if we know not truncate an Extension B ideograph due to splitting a surrogate pair, if we know not to convert ధు (TELUGU DHU) into ధ (TELUGU DHA)? Maybe.

But is just as possible to make the same kind of mistake as the assess example in other languages.

Which just goes to show every that developer has the power to make an ass out of themselves if they don't consider their options carefully. :-)


This blog brought to you by ధు (U+0c27 U+0c41, aka TELUGU LETTER DHU, aka TELUGU LETTER DHA + TELUGU VOWEL SIGN U)

Tom Ballard on 24 Jul 2008 7:10 AM:

Dude, I have know i'd ee ah what the fsck you're talking about :-)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day