Working hard to detect code pages

by Michael S. Kaplan, published on 2005/09/11 03:31 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/09/11/463437.aspx

Yesterday, Buck Hodges was talking about how TFS Version Control determines a file's encoding:

TFS Version Control will automatically detect a file's encoding based upon the following.

First, a file with a Unicode byte order mark (BOM) is added as that particular type (UTF-8, UTF-16 big endian, UTF-16 little endian, etc.).
If a file doesn't have a BOM, we check for an unprintable ASCII character in the first 1 kilobyte of the file. If there is no unprintable ASCII character in there, the encoding is set to the current code page being used, which is Windows-1252 on US English Windows systems.
If an unprintable character is detected, the file is detected as being binary. The unprintable ASCII characters detected are in the range of 0 - 0x1F and 0x7F excluding 0x9 (TAB), 0xA (LF), 0xC (FF), 0xD (CR), and 0x1A (^Z).

The only exception to the foregoing is PDF files. Those are always detected as binary because they are so common and can be all text in the first 1 kilobyte with binary streams later in the file. The detection is based on the signature, "%PDF-", that always appears at the start of a PDF file.

It would be so much cooler to try a bit more careful of an algorithm, such as the one that Jet/Access do in text import and link (they use MLang, as I pointed out here). Or at least something more primitive like the one the compiler uses that I discussed back in January -- especially since the list of unprintable characters is not including the code points that are known to be invalid in each code page (as the compiler folks are doing).

Ah well, it is no big deal. We ought to be using Unicode anyway, so what would really make me happy is if a future version of TFS would use something like IsTextUnicode (imperfect as it may be!) tp detect the times that it is Unicode without the BOM, perhaps supplemented by the check that Notepad uses for BOM-free UTF-8 that I mentioned in passing here.

This post is sponsored by "" U+feff (ZERO WIDTH NO-BREAK SPACE, of course)

no comments

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.

go to newer or older post, or back to index or month or day