When is a backslash not a backslash?
by Michael S. Kaplan, published on 2005/09/17, original URI: http://blogs.msdn.com/b/michkap/archive/2005/09/17/469941.aspx
The character in question is U+005c, the REVERSE SOLIDUS, also known as the backslash or '\'. It is the path separator for Windows, which is encoded at 0x5c across all of the ANSI code pages.
Since path separators are a pretty important requirement, the title of this post may seem a little scary -- how could it not be a backslash, a reverse solidus?
Well, on Japanese code page 932, 0x5c is the YEN SIGN, and on Korean code page 949, 0x5c is the WON SIGN.
Which is not to say that 0x5c does not act as a path separator -- it still does. And which is also not to say that the Unicode code points for the Yen and the Won (U+00a5 and U+20a9) do act as path separators -- because they do not.
Of course the natual round trip mapping between U+005c and 0x5c happens on all code pages, and both U+00a5 and U+20a9 have one-way 'best fit' mappings to 0x5c on their respective code pages. This requirement technically went away with Unicode, when the characters were encoded separately.
However, the issue is not a simple one of there not being space in the old code page and lots of space in Unicode, where customers will instantly move away from the not backslash path separators.
In practice, after many years of code page based systems in Japan and Korea using their respective currency symbols as the path separators, it is believed customers were simply used to this appearance. And there was therefore little interest in changing that appearance (when the system settings were Japanese or Korean) to anything but those symbols.
To support this expectation, Japanese and Korean fonts, whenever the default system locale is set to Japanese or Korean, respectively, will display the currency symbol rather than the backslash when U+005c is shown.
But whether or not this is really what customers want is still an open question. Andrew Tuck of PSS here at Microsoft noted:
When one of my customer’s from Korea was visiting here, I asked him if it bothered him that the backslash doesn’t appear as a backslash. It did bother him, and he believes it bothers most of his countrymen. However, he was fatalistic about it, "What can we do to change it. It’s been this way for a long time. We are used to it."
Hardly a glowing recommendation, is it?
And as Norman Diamond noted in his comments on this very blog (in this post), there are plenty of people in Japan who may not care for the convention, either.
Of course there is no 'right' answer here, and I would imagine that you would find plenty of people who would be unhappy with such a change, just as there are those who would be unhappy with the status quo. Which perhaps explains why the status quo seems to be as it is -- those people who would like a change are resigned to the idea that it may never happen. And so they are now used to it....
This post brought to you by "\", "¥", and "₩" (U+005c, U+00a5, and U+20a9, a.k.a. REVERSE SOLIDUS, YEN SIGN, and WON SIGN)
Chris Lundie on Saturday, September 17, 2005 6:07 AM:
Interesting and confusing! I fired up Word, changed the font to MS Gothic, typed a backslash and indeed it displays a Yen symbol. Changing back to Times, it looks like a backslash again.
Michael S. Kaplan on Saturday, September 17, 2005 9:59 AM:
Hi Chris -- well, if nothing else, it helps give confidence in the idea that the actual code point value does not change, even if the representative glyph does....
Ben Bryant on Saturday, September 17, 2005 11:50 AM:
So when you convert a string like "Yen 100" (where Yen is the symbol) from Shift-JIS to Unicode, the Yen becomes a backslash, and the fact that it was a Yen is unknown. But most users are not the wiser because their font shows the backslash as a Yen. This is bad! I suppose most Japanese Unicode databases must have some major backslash Yen confusion where you might have to implement a policy across unrepaired data like if you know the string is a file path treat U+005c as a backslash, otherwise as a Yen U+20a9. And Unicode string functions that search for the Yen probably are sometimes implemented due to user demand to cheat and look for the backslash too. Crazy stuff.
Btw, the Korean ks_c_5601-1987 has an encoding for the Yen, but Japanese Shift-JIS doesn't have one for the Won.
Michael S. Kaplan on Saturday, September 17, 2005 12:06 PM:
Hi Ben -- Although I admit the situation is not ideal, I think it is pretty obvious that the situation would be a *lot* worse if there were no path separators since that would pretty much destroy a lot more.
And it was not really up to anyone but JIS to decide what should be done in the Japanese encoding.
Though if you are using Unicode then you can use the actual Yen character and call it a day! :-)
Ben Bryant on Saturday, September 17, 2005 12:39 PM:
Thanks for the quick reply. But you almost answer as if I was blaming you or anyone. No, I agree the right choice was made in the JIS to Unicode conversion; I am just chiming in on the fact that it is a problematic situation. I do take comfort in the fact that the code point 5c is not changed. Anyway, encoding systems are full of these ideosyncracies, but the strange thing about this one is that it involves fonts. So many layers of complexity: a character set can be represented in different encodings and displayed differently by different fonts, what fun!
Michael S. Kaplan on Saturday, September 17, 2005 12:57 PM:
No worries, Ben -- I was not taking it as blame or anything. :-)
Claus Brod on Wednesday, September 21, 2005 10:54 AM:
Thanks for the interesting post. I've come across another slightly surprising conversion - the tilde character (0x7e) is mapped 1:1 when converting from CP932 to, say, UTF16 (MultiByteToWideChar). However, when I use libiconv to convert a tilde character, and tell it that the source encoding is "SJIS", it will map the tilde to U+203E. I found a few slightly mysterious references to this behavior, but nothing that would really explain the reasoning behind this mapping...
Steve Loughran on Friday, October 14, 2005 10:29 AM:
This is fascinating. Of course, the whole fact that DOS-derived platforms use \ as a dir separator is iself a bit of a mess -I've always assumed it was because DOS 1.x used / as an argument prefix, so when directories came along in 2.0, they had to use a different char. This is just another unintended consequence of the first, well, error.
Rune on Friday, October 14, 2005 10:40 AM:
IIRC, the backslash maps to "Ø" using a norwegian codepage with 7-bit ASCII. Later (8-bit codepage 865) the Ø moved to the same spot normally inhabited by the Yen sign (american cp 437).
So a backslash is probably a lot of things around the world, historically speaking.
Craig Ringer on Friday, October 14, 2005 1:14 PM:
Quite frankly, these days I begin to think that having common characters as path and argument separators is anachronistic and pretty nasty. Any possible choice from the standard character set will be largely arbitrary and will impair "legitimate" use of those characters. It's necessary due to all the legacy code out there and it'd be an unpleasant thing to try to change to dedicated symbols, but that doesn't make it any nicer.
Nonetheless, I still find it annoying that \ (or / on UNIX, or : on Mac OS < 10) are "special" to the system and can't be used in filenames. Similarly the need to quote "multi word arguments" seems silly these days. Dedicated delimeters just for those two purposes, and nothing else, would seem a much nicer way to do it if we got the option to do it all again.
Alas, it's unlikely to ever happen. We'll still want to use 7-bit ASCII serial terminals, still want to use ancient systems that don't understand utf-8 or UCS-2, and so on. Anyway, I swear every time I have to use \these\silly\paths , just as I'm sure many folks here find /these/paths/very/annoying ; getting used to someːotherːpathːseparator (obviously not actually ː , I just use that as an example) would be pretty irritating. Especially having to use some sort of compose sequence or shortcut to type it on "legacy" keyboards...
andypennell on Friday, October 14, 2005 2:02 PM:
Why was \ the dir separator in DOS? A friend of a friend once bought a big wooden desk from a sale on Microsoft campus in the early 90s. The desk had not been cleared out: inside it was a pice of paper containing discussion notes on which separator should be used. IBM was mentioned, but I cannot recall the rest of the details.
mikeb on Friday, October 14, 2005 2:47 PM:
As far as the '\' character being used as the path separator in DOS (starting with DOS 2.0, since 1.x did not support sub directories), I would have to agree with Steve Loughran that the fact that DOS commands used '/' as a command line option 'switch character' is probably the number 1 reason.
Remember that internally and at the API level DOS supports using either '/' or '\' as a path separator (I'm not sure if this agnosticism goes all the way back to DOS 2.0) - it's applications that don't like '/'.
Also remember that early versions of DOS supported setting the 'switch character' to something other than '/'. Unfortunately each DOS application is responsible for parsing it's command line, and virtually no 3rd party applications supported the switch character setting (Microsoft applications may have been guilty of this, too).
At some point MS removed the set switch char API - a *very* rare thing for Microsoft to do (just ask Raymond Chen). I mean, the old CP/M compatibility DOS calls are still supported even in the WinXP VDM.
Dewi Morgan on Friday, October 14, 2005 7:06 PM:
Mikeb points out that the problem with path parsing is that it depends on the applications to support it.
It seems strange that people aren't offered a choice of visual style nowadays, though, since 4c is the character that applications expect.
Just need a little checkbox in windows versions affected by the 0x4c issue:
[ ] Display path separator as '\'
So that wherever 4c was displayed, a '\' would be shown instead of a currency symbol.
Michael S. Kaplan on Friday, October 14, 2005 7:37 PM:
I think you meant 0x5c/U+005c, right? :-)
You do have a choice -- this only happens with CJK fonts....
foxyshadis on Friday, October 14, 2005 11:07 PM:
I suppose you could use a tool (eg, fontographer or fontforge) to replace 0x5c in your favorite fonts (tahoma, ms sans serif, maybe times, arial, and verdana) with your favorite path separator. Oh, and paint over your keyboard's \ key. That would be an amusing extension of 'skinning' the OS. =p
koji on Saturday, October 15, 2005 8:19 AM:
IIRC, if you go back a little further, ASCII defines 0x5c as one of the "localizable" code point, and that is why several countries have several different glyphs here.
DOS 2 made a mistake by choosing such a localizable code point as the path separator. Well, I don't think anyone can blame on it though.
Whether we should fix this glyph or not is as good open question as whether we should fix the path separator to "/", which is not a localizable code point in ASCII.
And, although both are good questions, I don't think anyone could fix either.
not given on Sunday, November 11, 2007 6:02 PM:
Was a work around ever found for this issue? Is there a way to keep Japanese language support in Windows and have the backslash ( \ ) display correctly in address fields instead of the yen symbol?
It works correctly when typing in fourms, but not in the address fields? Why is this?
Michael S. Kaplan on Sunday, November 11, 2007 6:09 PM:
It is all about the font selected and sometimes the technology doing the rendering -- and for every person who considers one behavior to be a bug, there is another who thinks the other is a bug.
Which essentially makes it unfixable, at least for everyone....
Mike on Saturday, January 05, 2008 6:26 PM:
I got the same thing, after playing Clannad the primary fonts seem to be MS Gothic lol, yet it still types as \ here.
go to newer or older post, or back to index or month or day