by Michael S. Kaplan, published on 2011/10/13 07:01 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2011/10/13/10224477.aspx
Over on the Unicore List, the question was a familar one:
I am converting text in an ANSI-encoded document to Unicode using
search and replace in Notepad on Windows Vista SP2. The source
document contains text in the 8-bit CSX+ encoding for Indic
transliteration. A chart of the CSX+ encoding is available at:
http://homepage.ntlworld.com/richard.wordingham/10646/CSX+.htm
In the CSX+ encoding, ASCII 254 'þ' represents the character 'ḥ'
(h-underdot). When I perform a search and replace of ASCII 254 'þ' to
Unicode 'ḥ' U+1E25 LATIN SMALL LETTER H WITH DOT BELOW, the operation
not only converts all instances of 'þ' to 'ḥ', but also all instances
of 'th' to 'ḥ'. For example, the word 'rathaþ' gets caught in the
replace and is changed to 'raḥaḥ'.
This is rather unexpected behavior. I would consider this an error,
but perhaps a very well-intentioned one, given that the phonetic
representation of 'þ' in Old English is in fact /th/.
Is there some internal Windows mechanism that treats ASCII 254 þ as
being canonically equivalent to 'th'? Or, perhaps is the equivalent
rule the dastardly deed of some Old English language enthusiast turned
techie? :)
Best,
Anshuman
Yet another "misuse" of Notepad beyond the old UTF-8 BOM? :-)
This one is kind of my fault., too.... not directly since I am not the one who changed Notepad, but I am the one who added the function and then pushed them to use it in Notepad (fixing the problems I pointed out in blogs like When Notepad's Find doesn't and The fallacy of comparing out of context and so on more than half a decade ago).
In Vista, bringing FindNLSString brought the full power of Windows collation to the Find/Replace capabilities of Notepad.
So all of the various Unicode canonical forms will always be equal and so on.
This is a good thing.
Unfortunately for Anshuman, it also brings our EXPANSIONS along.
In particular, the following two entries:
0x00de 0x0054 0x0048 ;TH
0x00fe 0x0074 0x0068 ;th
The only locale whose sorting negates this equivalence is Icelandic.
Perhaps if one is running on Vista or later, switching to an Icelandic user locale (aka "Standards and Formats") will provide a workaround for the Thorn in your side.
Well, this one, at least!
John Cowan on 13 Oct 2011 8:17 AM:
The actual moral, I think, is that while a search function should be permissive, a replace-all function should be rather more persnickety, because it is destructive and not undoable.
Michael S. Kaplan on 13 Oct 2011 8:26 AM:
^Z is very handy in those cicumstances, as is choosing not to save. :-)
The replace choice lets you review every change; it is only the "replace all" button that is nuclear....
ErikF on 13 Oct 2011 9:39 AM:
WordPad seems not to have this "feature" in its search-and-replace (at least not in Windows 7 using the English (Canada) locale.) I just tried that replacement and it worked fine. So, if you can use WordPad for the replacement part if nothing else, that seems like another workaround.
Michael S. Kaplan on 13 Oct 2011 10:10 AM:
There are tradeoffs, though -- they don't do the Unicode canonical equivalence thing so well in Wordpad....
Doug Ewell on 13 Oct 2011 10:24 AM:
The whole notion of using Notepad, or any other editor, as a charset conversion tool seems a bit suspect. If he has the time and ability, he'd probably be better off writing a custom Encoding class.
Joshua on 14 Oct 2011 9:08 AM:
I would hope that "match case" would fix the problem but I'll bet it doesn't.
People keep on trying to use notepad for what it was in '98: a *dumb* text editor. Back in the day, you could do a search&replace on a binary file and expect it to work most of the time if you didn't change the length of anything.
Michael S. Kaplan on 14 Oct 2011 10:37 AM:
It does fix in <= Server 2003.