Every rose has it's Þ....

by Michael S. Kaplan, published on 2011/10/13 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/10/13/10224477.aspx

I am converting text in an ANSI-encoded document to Unicode using
search and replace in Notepad on Windows Vista SP2. The source
document contains text in the 8-bit CSX+ encoding for Indic
transliteration. A chart of the CSX+ encoding is available at:
http://homepage.ntlworld.com/richard.wordingham/10646/CSX+.htm

In the CSX+ encoding, ASCII 254 'þ' represents the character 'ḥ'
(h-underdot). When I perform a search and replace of ASCII 254 'þ' to
Unicode 'ḥ' U+1E25 LATIN SMALL LETTER H WITH DOT BELOW, the operation
not only converts all instances of 'þ' to 'ḥ', but also all instances
of 'th' to 'ḥ'. For example, the word 'rathaþ' gets caught in the
replace and is changed to 'raḥaḥ'.

This is rather unexpected behavior. I would consider this an error,
but perhaps a very well-intentioned one, given that the phonetic
representation of 'þ' in Old English is in fact /th/.

Is there some internal Windows mechanism that treats ASCII 254 þ as
being canonically equivalent to 'th'? Or, perhaps is the equivalent
rule the dastardly deed of some Old English language enthusiast turned
techie? :)

Best,
Anshuman

This one is kind of my fault., too.... not directly since I am not the one who changed Notepad, but I am the one who added the function and then pushed them to use it in Notepad (fixing the problems I pointed out in blogs like When Notepad's Find doesn't and The fallacy of comparing out of context and so on more than half a decade ago).

In Vista, bringing FindNLSString brought the full power of Windows collation to the Find/Replace capabilities of Notepad.

Perhaps if one is running on Vista or later, switching to an Icelandic user locale (aka "Standards and Formats") will provide a workaround for the Thorn in your side.

The actual moral, I think, is that while a search function should be permissive, a replace-all function should be rather more persnickety, because it is destructive and not undoable.

^Z is very handy in those cicumstances, as is choosing not to save. :-)

The replace choice lets you review every change; it is only the "replace all" button that is nuclear....

WordPad seems not to have this "feature" in its search-and-replace (at least not in Windows 7 using the English (Canada) locale.) I just tried that replacement and it worked fine. So, if you can use WordPad for the replacement part if nothing else, that seems like another workaround.

There are tradeoffs, though -- they don't do the Unicode canonical equivalence thing so well in Wordpad....

The whole notion of using Notepad, or any other editor, as a charset conversion tool seems a bit suspect. If he has the time and ability, he'd probably be better off writing a custom Encoding class.

I would hope that "match case" would fix the problem but I'll bet it doesn't.

People keep on trying to use notepad for what it was in '98: a *dumb* text editor. Back in the day, you could do a search&replace on a binary file and expect it to work most of the time if you didn't change the length of anything.

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.