año del ano, a.k.a. This sentence has several non-skarklish Spanish flutzpahs....

by Michael S. Kaplan, published on 2007/06/08 09:42 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/06/08/3162451.aspx

One that you probably won't get unless you know a bit of Spanish. But I suppose you could probably ask someone, or go to an online translation site or something....

The alternate title is more of an allusion, but if you don't know the source then knowing that might be as helpful as reading The Waste Land to an illiterate. :-)

I mean sure, it might pay for college. But it might also cause you to fail out of college if it causes widespread confusion about your meaning and your area of study can't afford that kind of confusion.

In the case of the title, confusion slides into a whole new area, where the stripped word causes one to bare more than they can really bear.

So thinking back to the model of that blog post, it suggested finding a strategy to only strip diacritics from text when they conveyed some specific meaning and would therefore corrupt or in some other way injure the meaning of the text.

In email, people suggested ideas like looking for primary collation distinctions under the theory that collation mirrors thought and that it indicated something about the letter being thought of as having some kind of primary difference (whether it was thought to be related to a form of the base letter in some sense or not).

I admit the idea intrigued me and I even prototyped a few things (I may even write about the experiments here at some point).

But isn't the whole argument a flawed one, since if you want to avoid an effect such as corruption, the only answer is to not strip diacritics?

Corruption is really the least of the concerns one might have, since the operation could move into actual offense on the part of the reader of the text, sometimes.

I don't mean to get all moralistic on my readers (and believe me I am a bad source if one is looking for a moral center to their life) but I think the best answer is to not get into stripping.

This post brought to you by ñ (U+00f1, a.k.a. LATIN SMALL LETTER N WITH TILDE)

We just recently got a spreadsheet back from a customer containing the Spanish translations for their application's user interface, asking whether there would be any issues displaying any of the text in the UI.

I was able to conclude it would be fine - except that instead of acute accents on the vowels, they'd somehow managed to use the character with hook above (e.g. U+1EE6 Latin Capital Letter U With Hook Above rather than U+00DA Latin Capital Letter U With Acute, or Ú for short). Not sure how they'd managed this - OCR maybe? Wikipedia says U+1EE6 is used in Vietnamese for Hỏi (Dipping-rising) tone.

Looking through the code I wasn't quite sure what would happen to it, given it calls WideCharToMultiByte with dwFlags set to 0. I think it would go through best-fit mapping simply stripping the accent off.

The conversion is necessary as this is our thin-client application server, where the wire protocol is currently byte-oriented for historical reasons (the client was originally written for DOS). In contrast the application plug-in interface is entirely Unicode as it's a COM interface designed for use from VB6. In practice this means that SBCS only is supported on the wire. This is not currently limiting product sales as we have very few overseas customers - most of the localization work is for UK-based customers with a few overseas branches.

Hahahaha.

In Portuguese, "año" and "ano" are, respectively, "ano" and "ânus".

Now imagine the confusion :-)

Feliz ano novo in Portuguese!