Working hard to detect code pages

by Michael S. Kaplan, published on 2005/09/11 03:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/09/11/463437.aspx


Yesterday, Buck Hodges was talking about how TFS Version Control determines a file's encoding:

TFS Version Control will automatically detect a file's encoding based upon the following.

The only exception to the foregoing is PDF files.  Those are always detected as binary because they are so common and can be all text in the first 1 kilobyte with binary streams later in the file.  The detection is based on the signature, "%PDF-", that always appears at the start of a PDF file.

It would be so much cooler to try a bit more careful of an algorithm, such as the one that Jet/Access do in text import and link (they use MLang, as I pointed out here). Or at least something more primitive like the one the compiler uses that I discussed back in January -- especially since the list of unprintable characters is not including the code points that are known to be invalid in each code page (as the compiler folks are doing).

Ah well, it is no big deal. We ought to be using Unicode anyway, so what would really make me happy is if a future version of TFS would use something like IsTextUnicode (imperfect as it may be!) tp detect the times that it is Unicode without the BOM, perhaps supplemented by the check that Notepad uses for BOM-free UTF-8 that I mentioned in passing here.

This post is sponsored by "" U+feff (ZERO WIDTH NO-BREAK SPACE, of course)


no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day