Where short file names can fail

by Michael S. Kaplan, published on 2012/02/20 07:11 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2012/02/20/10269748.aspx

Functions like GetShortPathName have been around for a long time.

Too long, if you ask me.

Because there are some things that it incorrectly assumes.

For example, it assumes that any characters that fit within the default system codepage are okay, despite the fact that one change the setting and reboot, leading to different results.

But perhaps this incorrect assumption is somewhat forgivable since the system seems ot behave the same way, and changing the default system locale and code page is not a common operation.

And there are some things that it does the wrong way.

For example, instead of calling WideCharToMultiByte with the WC_NO_BEST_FIT_CHARS flag, it instead calls RtlUnicodeStringToAnsiString which calls RtlUnicodeToMultiByteN, neither of which allows one to opt out of the best fit behavior -- something tat befits any file system type functions that want more accurate results!

Now I've ranted about the problems with best fit mappings over the years:

But there is one thing worse than best fit mappings -- and that is file system functions allowing their presence in cases where they shouldn't

In my opinion, this kind of problem is in some cases a lot less forgivable, since there are some languages that will commonly use letters like đ aka U+0111 aka LATIN SMALL LETTER D WITH STROKE which will best fit map in some code pages to d aka U+0064 aka LATIN SMALL LETTER D.

Functions that make incorrect assumptions like "everything that best fit maps in a code page fits in the code page" are awful since in the most extreme cases there are more than twice as many characters with best fit mappings as there are with correct mappings.

 In the end, functions that behave this badly should be avoided, for code safety reasons....

John Cowan on 20 Feb 2012 8:18 AM:

In hindsight, generated short names should probably have been restricted to the ASCII repertoire.

Michael S. Kaplan on 20 Feb 2012 9:31 AM:

In some cases, they are -- which makes the situation even more broken!

Simon Buchan on 20 Feb 2012 1:02 PM:

I'm a little confused by the issue, though: as far as I can tell, in the worst case your short names are incomprehensible or misleading... which they can be anyway even without codepages getting involved! To me short names are just a way to get a legal name you have a decent chance of mapping to the true name.

Wait, what happens when the system code-page changes and a bad program has saved short-names? Does the reinterpretation of the saved name always map to the reinterpretation of the file-system name? My gut says yes, but I'm not super confident.

Michael S. Kaplan on 20 Feb 2012 1:29 PM:

If you can't find a filename, that's a Really Bad Thing™. As is having something that used to work break depending on your system locale!

Simon Buchan on 20 Feb 2012 3:43 PM:

Well, sure, but I can't figure out how best-fitting is worse than just a mess of '?'s for finding your files in a list of "~1"s. Is that not what you're saying it's doing?

How does the system locale actually break the short names? My intuition tells me a that since decode(bytes, cp) == decode(bytes, cp), no matter what bytes or cp is, that even if files get decoded differently when the ACP changes, programs at least will still find them. Is it that after a locale change that names can clash? E.g. for some str1, str2, cp1, cp2 that decode(encode(str1, cp1), cp2) == decode(encode(str2, cp1), cp2)? That sucks, but it seems unavoidable.

(sorry if this is a repost)

Michael S. Kaplan on 20 Feb 2012 4:06 PM:

~1 and ~2 and so on are unreadabe but programatically functional -- an incorrect file name, on the other hand...

go to newer or older post, or back to index or month or day