I am not a nudist, but I do support stripping when it is appropriate, part 1

by Michael S. Kaplan, published on 2007/09/04 03:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2007/09/04/4734456.aspx

The title is correct: I am not a nudist.

I think people who are nudists are just fine and I enjoyed the whole 'ugly naked guy' plotline from Friends, it just isn't really my usual kind of thing.

Of course regular readers of the blog, after reading posts like the following:

might think I have a greater interest in stripping than would typically be expected of a non-nudist.

These posts and variations on the code in them have been picked up by lots of folks and used a bunch of people both inside and outside of Microsoft now, something that I find frankly a little scary, since the whole notion of stripping diacritics has its problems, a I pointed out in these posts:

The conclusion that was really reached was that a way to know when what Unicode might consider to be a diacritic should be instead treated as a more essential part of the letter.

It would be have to be language specific, since it is perfectly reasonable for me to look at an Ö (U+00d6, a.k.a. LATIN CAPITAL LETTER O WITH DIAERESIS) and think of it as a regular O (U+004f, a.k.a. LATIN CAPITAL LETTER O) with me simply not cleaning my monitor often enough, while a Finn or a Swede will instead look at it as a separate letter that sorts after Z (U+005a, a.k.a. LATIN CAPITAL LETTER Z).

Obviously in this case I find the stripping appropriate, while friends Anna or Erkki would not due to their native Swedish and Finnish backgrounds, respectively. Stripping should be reserved for the sauna in that case, leaving these poor letters alone!

Here I am, an employee of Microsoft, describing a requirement.

And then I pointed out that we don't have such a function. At all.

As a state of affairs this is kind of bad, especially since as I have said there are people who went on to use that code.

After the most recent email I was sent from someone inside of Microsoft who had actually had to deal with a bug report related to someone being unhappy with the behavior in a specific language, I figured it might be nice to give a better answer than "here is some code but it is missing thus other feature which is necessary sometimes."

And now I will proceed to tease you all who have read this far, as I am not going to supply the code for this function or even tell you how it will work until tomorrow. :-)

Until then....


This post brought to you by Ö (U+00d6, a.k.a. LATIN CAPITAL LETTER O WITH DIAERESIS)

# Richard on 5 Sep 2007 6:16 AM:

And you didn't take the gratuitous opportunity for the rock umlaut beating "nüd̤ïst" (assuming the combining diaeresis below survives http).

And I now see that many typefaces don't seem to handle that d + combining diaeresis below very well... for once Arial Unicode seems to work better than some newer (Corbel) typefaces.

referenced by

2007/09/10 A&P of Sort Keys, part 0 (aka The empty string sorts the same in every language)

go to newer or older post, or back to index or month or day