Where short file names can fail

by Michael S. Kaplan, published on 2012/02/20 07:11 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2012/02/20/10269748.aspx

For example, it assumes that any characters that fit within the default system codepage are okay, despite the fact that one change the setting and reboot, leading to different results.

But perhaps this incorrect assumption is somewhat forgivable since the system seems ot behave the same way, and changing the default system locale and code page is not a common operation.

For example, instead of calling WideCharToMultiByte with the WC_NO_BEST_FIT_CHARS flag, it instead calls RtlUnicodeStringToAnsiString which calls RtlUnicodeToMultiByteN, neither of which allows one to opt out of the best fit behavior -- something tat befits any file system type functions that want more accurate results!

But there is one thing worse than best fit mappings -- and that is file system functions allowing their presence in cases where they shouldn't

In my opinion, this kind of problem is in some cases a lot less forgivable, since there are some languages that will commonly use letters like đ aka U+0111 aka LATIN SMALL LETTER D WITH STROKE which will best fit map in some code pages to d aka U+0064 aka LATIN SMALL LETTER D.

Functions that make incorrect assumptions like "everything that best fit maps in a code page fits in the code page" are awful since in the most extreme cases there are more than twice as many characters with best fit mappings as there are with correct mappings.

In the end, functions that behave this badly should be avoided, for code safety reasons....

I'm a little confused by the issue, though: as far as I can tell, in the worst case your short names are incomprehensible or misleading... which they can be anyway even without codepages getting involved! To me short names are just a way to get a legal name you have a decent chance of mapping to the true name.

Wait, what happens when the system code-page changes and a bad program has saved short-names? Does the reinterpretation of the saved name always map to the reinterpretation of the file-system name? My gut says yes, but I'm not super confident.

If you can't find a filename, that's a Really Bad Thing™. As is having something that used to work break depending on your system locale!

Well, sure, but I can't figure out how best-fitting is worse than just a mess of '?'s for finding your files in a list of "~1"s. Is that not what you're saying it's doing?

How does the system locale actually break the short names? My intuition tells me a that since decode(bytes, cp) == decode(bytes, cp), no matter what bytes or cp is, that even if files get decoded differently when the ACP changes, programs at least will still find them. Is it that after a locale change that names can clash? E.g. for some str1, str2, cp1, cp2 that decode(encode(str1, cp1), cp2) == decode(encode(str2, cp1), cp2)? That sucks, but it seems unavoidable.

(sorry if this is a repost)

Please consider a donation to keep this archive running, maintained and free of advertising.
Donate €20 or more to receive an offline copy of the whole archive including all images.