Encoding support can be found in the strangest places....

by Michael S. Kaplan, published on 2005/02/20 20:51 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/02/20/377116.aspx

Jason (an SDE/T somewhere in Windows) posed the following question yesterday afternoon:

I am writing a script to test localized Windows qfe package INFs.  These are stored as ANSI files.  I am using a Unicode XML file for storing my comparison strings.  My script must run on localized windows builds of various default codepages which affects how the string comes out when I read it from the ANSI file.  The test will run fine for all languages when run on an English box but when run on say a Chinese box the string comes out differently and my test breaks unnecessarily. 

Is there a way to always get the same string out of this file?

Obviously this is a problem. VBScript mostly assumes the default code page and the File System Object assumes that code page, UTF-8, or UTF-16 depending on how you set it. In this case, Jason is looking for code page 1252 to always be used.

Incidentally, as I hinted at in 'How does it detect invalid characters?' it will sometimes be able to fail on other code pages, as well (basically any time there are slots that do not have a mapping). But I did suggest a workaround:

Maybe there is something clever you could do with ADODB.Stream, its LoadFromFile method, its Charset property, and its ReadText method?

I did not intend to be mysterious, I just was not sure here. I vaguely remembered someone suggested using ADODB.Stream in a similar situation and did not want to over-promise a solution. But sure enough he posted back today that it worked!:

I’ve included my code below in case anyone else wants to see how it’s done.  It’s quite simple once you know which object to use.  The code below will load an ANSI file to the same character set displayed in EN notepad, allowing me to copy and paste the characters from Notepad into my Unicode test data file and always read the same thing from the INF no matter what language I am running on:

    ' load using windows-1252 character set
    dim oStr, WorkingBuffer
    set oStr = CreateObject("ADODB.Stream")
    oStr.CharSet = "windows-1252" ' code page of the inf files
    oStr.LoadFromFile FileName
    WorkingBuffer = oStr.ReadText
    set oStr = nothing

I never would have thought to use ADODB objects for this :P

After he posted that, I went to find the reference in my archives, something made surprisingly easy by the fact that I pretty much never deal with ADODB streams for any other purpose. It was David Copenhaver in the microsoft.public.vb.general.discussion newsgroup, who posted the following code:

Private sub t2UTF (Path as string)
    Dim bob As ADODB.Stream
    Set bob = New ADODB.Stream

    bob.LoadFromFile Path 'Loads a File
    bob.Charset = "UTF-8" 'sets the stream encoding to UTF-8
    bob.SaveToFile Path, adSaveCreateOverWrite 'Save File
    set bob = nothing
end sub

So, sorry to make you figure it out yourself, Jason (I should have looked in the archives first!).

The obvious question at this point would be to wonder why they are using a String for Charset property when what is being dealt with is code pages. I honestly have no clue, but I'll give them the benefit of the doubt and assume it is to ease the ability to use the object with HTML files and their charset property.

It is amazing where you can find support for international features....


This post brought to you by "𐐐" (U+10410, a.k.a. DESERET CAPITAL LETTER H)

# Dean Harding on 20 Feb 2005 7:51 PM:

The ADODB objects are also used by CDO for email (somethinge I'm all too familiar with at the moment!) so having a string charset is good because you can pass the MIME strings in and it works out what you're actually talking about.

# Michael Kaplan on 20 Feb 2005 7:54 PM:

In that case, I am totally ok with this -- a software component that tries to please a specific consumer of its functionality? I am definitely a fan!

go to newer or older post, or back to index or month or day