Is Excel CSV misusing NLS functionality?

by Michael S. Kaplan, published on 2005/09/17 08:27 -07:00, original URI: http://blogs.msdn.com/michkap/archive/2005/09/17/470413.aspx


In the  newsgroups, Yi Zhang asked:

Hi all

I've encountered the following situation in my work: a customer of my company reported that he can't read a file created in German user locale correctly from a English user locale. This is because German locale uses ',' as the decimal symbol while English uses '.' I tried to experiment with .csv files in Microsoft excel, and it seems that excel will always use current user locale to parse the .csv files, which could end up with incorrect result if the .csv file is created in a different user locale. So this seems to suggest that user should be responsible for setting the correct user locale before he tries to open a file. But this might be too much for the user. Is there a better way to do this? Or user should always be responsible for setting the correct user locale?

Thanks in advance

Yi Zhang

I had vaguely remembered people mentioning something like this before, but I had always been using the Access wizard since it does a better job here.

Anyway, I decided to look in Excel's help to see if it could provide any assistance. Here is what I found in the Excel 2003 help:

You can change the separator character used in both delimited text files and comma separated values (CSV) text files.

Change the delimiter in a delimited text file

For a delimited text file, you can change the delimiter from a Tab character to another character in the second page of the Text Import Wizard. From the same wizard page, you can also change the way consecutive delimiters, such as consecutive quotes, are handled.

Change the separator in a CSV text file

  1. Click the Windows Start menu.
  2. Click Control Panel.
  3. Open the Regional and Language Options dialog box.
  4. Click the Regional Options Tab.
  5. Click Customize.
  6. Type a new separator in the List separator box.
  7. Click OK twice.

Note  After you change the list separator character for your machine, all applications will use the new character. You can change the character back to the original character by using the same procedure.

Hmmmm. Well, I guess this is not quite as bad as instructions for drivers that actually show a picture of the "This driver is not signed..." dialog as part of the instructions to install their driver, but it is about the same order of magnitude. Also it is a little obnoxious to instruct people to change the settings that will affect all applications so that an individual file that may have been sent by someone else with different settings can be imported....

I guess it is why I prefer the Access import text wizard, which lets you specify a different separator (after trying to initially guess at the one to use by looking at the actual file!)

Irregardless, Mihai responded to the post where I put the info from help:

Sorry MichKa, but this is really bad. It is not your fault, I know, it is Excel (or the Excel team)!

The help teaches one to mess-up the settings for all application because Excel does not understand some basic concepts:

- The 10 years old bullet in Nadine Kano's book "All language editions can read one another's documents"

- CSV (Comma Separated Values), means COMMA separated, not "separated by the locale-dependent list separator"

=================

Yi Zhang: my advice would be to try some formats other than CSV. I have done some preliminary tests, and Excel 2003 seem to work ok with XML, Tab delimited text, and Unicode Text (which is also tab delimited). I would vote with XML or Unicode Text (using Unicode, so you will have no problems with code pages and lost characters).

I admit that Mihai has a point here -- one only has but to read the name. But short of that (and to retain backward compatibility), I could live with a configurable setting, so that one can change the behavior for the one import without affecting everything else the user runs. And ideally something that will look at the data and make an intelligent guess would be a good thing....

The thread ended with some input from Louis Solomon:

try open office ...

Now Louis included no information on what Open Office does to make this situation different, for better or worse. So I honestly cannot say what the purpose of his three word response was. I decided to search for his name connected to Open Office in newgroups to see if he had perhaps said more about it somewhere else. But the only one I found was:

Louis Solomon [SteelBytes] wrote:
> The only thing that I can see so far that stops me switching to OpenOffice
> (from MS Office), is lack of English UK (Australian in particular) support.

Not entirely relevant, but it makes for an interesting ounterpoint if he is not using the product (he may be now, but if so he has not been posting on it much under that name)

Anyway, does anyone know if the recommendation can be salvaged? Does Open Office have a particular feature in this area that is easier and less involved than just changing the file extension to .TXT from .CSV?


# Alex on Saturday, September 17, 2005 2:46 PM:

What if you, in excel, select file - Open, choose "All Files" as file type and pick that csv file.

For me it opened up the wizard.

# Stephane on Saturday, September 17, 2005 3:03 PM:


In OpenOffice 2.0 beta 2 (don't know about earlier versions), when you save a spreadsheet as CSV, you can choose the delimiter. That's not what Excel does. (there is also a bug in Excel, the UI keeps telling you that it is saving the csv with the semi-colon as a separator, although it's really using the user locale).


# Mihai on Saturday, September 17, 2005 3:10 PM:

Some configuration somewhere would probably make some users happy. I would be happy :-)

In the same spirit, I would be happy to have Notepad asking be if I want BOM in UTF-8 files :-)

# CornedBee on Saturday, September 17, 2005 4:29 PM:

> In OpenOffice 2.0 beta 2 (don't know about earlier versions), when you save a spreadsheet as CSV, you can choose the delimiter.

OOo 1.1.4 does the same.
It also brings up a wizard on import, just like I remember Excel doing, so I don't know what's the fuss about. (But then, it's been a long time since I used Excel.)

And what's that about OOo not supporting en_UK? Have been using a UK dictionary for the longest time without problems, and haven't noticed any other locale problems either - although admittedly, I'm not British, so I might not notice.
You might have to explicitely download the UK dictionary.

# Ruben on Saturday, September 17, 2005 5:01 PM:

It's not just Excel, it's Word as well that has this obnoxious behavioor. For instance, in Word you can put equation field codes, like {EQ \f(1,2)} which gives you 1/2 but with the one above the 2. Also works with things like roots {EQ \r(3,x)}, sums {EQ \i\su(n=1,5,n)}.

Cool stuf, and it honours your font settings, unlike the equation editor, and doesn't blow up the document size with all that embedded OLE junk.

BUT, guess what happens on European locales? It just shows the text "Error!" No explanation, just "Error!" No workaround either (except the nuke your regional settings stuff).

And when you try to make running heads based on the headings in your text? Pretty standard practice you'd think. Except for Microsoft: opening the document in another language version of Word (not another locale, mind you, just another UI language), all standard styles are renamed, erm localized, and the running heads display "Error, style not defined". No other workaround than not using built-in styles, or installing a different language version for Word.

So you need an English version of XP, a French version of Word, and Dutch regional settings, just to read a silly document from the Walloon (Belgium) department. But a Dutch version of Word with US regional settings for a document from your sysadmin.

Our company happens to have more than one language version of Office across offices, so this happens constantly.

These things really gives me the warm and fuzzies. But I'm sure as hell not going pay $$$ just to be allowed to file a bug report. Heck, who is?

So, in conclusion, I don't think you're getting through to the Office team just yet, Michael. Nifty toolbars are more important, I guess. The world's loss really. Damn Americans. ;-)

# Matthew W. Jackson on Saturday, September 17, 2005 7:38 PM:

I assume the suggestion to use OpenOffice is in regard to the need for a good exchange format for spreadsheets (and not just binrary Excel files, which I hear have their own problems with portability due to NLS "features").

CSV is often used as a document exchange format for simple spreadsheets, but it's really not very good because the "spec" is way too loose. I suppose a CSV "version 2" that required UTF-8 and had very specific rules on delimeters and escaping would help, but only if all competing programs supported it without much hassle. Still, it still wouldn't be good enough, since you cannot specify header cells and data types (not to mention no formatting).

Both the XML-based format of OpenOffice and the upcoming XML-based formats of Microsoft Office are good solutions to the problem. Unfortunately, it remains to be seen how long it will take before A) the next version of Office is released and B) the service packs for Office 2000/XP/2003 are released to support the new XML forma. So if you need your excel spreadsheets in a form that can be easily read/written by external tools, OpenOffice is a very attractive prospect at the moment. It certainly beats CSV or Excel ActiveX Automation.

# Ben Bryant on Sunday, September 18, 2005 11:22 PM:

But CSV means you enclose in double quotes if the value contains a comma. If the data is not broken, I don't understand why people are discussing the delimiter. The issue is simply an issue that has nothing to do with the file format and everything to do with the way a number is stored as a string (whether it is XML,CSV or any text format). In this case the file should contain some sign of what locale it was generated in, or else needs to use a specific convention for storing numeric values as strings.

# Centaur on Monday, September 19, 2005 12:16 PM:

Oh, those really are pet peeves of mine.

Excel CSV export uses regional list separator and decimal separator, instead of comma and decimal point, defeating the purpose of export. Here, Excel definitely abuses NLS. CSV files are meant to be parsed and imported, not read directly by humans.

Another thing that bothers me is that Excel’s function names are translated in localized versions. In Russian Excel, AVERAGE is called СРЗНАЧ, which is wrong on three points. First, it is counter-intuitive for everyone who has used the original version (but then again, localized menu commands are counter-intuitive for me, too). Second, it causes novice users to internalize the translated names, which will cause them pain when they move to another localization. Third, I suppose it is a lot of effort for Excel developers to enable reading files created by a differently localized version.

# Ben Bryant on Monday, September 19, 2005 12:24 PM:

Michael, I don't know how I can say this strongly enough (may be I was not clear in my previous comment). This entire article or yours and all of the comments miss the point, misguided by the original Yi Zhang posting you quoted. Yi Zhang wrote of trouble reading "a file created in German user locale correctly from a English user locale" and then associated the problem with the delimiter. However, it has NOTHING TO DO WITH THE DELIMITER! The problem is the fact that when the number is represented in a string format, the decimal may be either a period or a comma depending on locale. Some comments suggested other formats such as XML, but XML WOULD NOT CHANGE THE PROBLEM! You need to speak to the issue of reading a text number and converting back to a binary number type, otherwise you are compounding this person's problem by going knee deep in the wrong direction. The right questions to answer are these:
1. On export to CSV file in a locale with a comma decimal point Excel puts numeric values containing decimals in quotes, right? i.e. so the CSV format is not broken.
2. On import, will Excel properly handle numbers written with a different decimal point character than the one in the current locale? I assume not.
3. On import, in general does Excel autodetect numbers using the current locale and give the corresponding fields a numeric type?
4. On import, what are the solutions or hooks for intercepting the process of converting those numbers from strings?
5. Or is the only solution a custom import facility that generates the Excel document using OLE or something.
These answers would provide a service to Yi Zhang, otherwise you have only compounded his problem.

# Mihai on Monday, September 19, 2005 12:40 PM:

Ben Bryant--is not only about the decimal separator, is also about the value separator. The comma is replaced by semicolon in some locales.
Example: saving [1][2] from French you get 1;2, then you read on English and get [1;2] ([] = cell)

And is not only about deciding what is the decimal separator, but also about the meaning.
Examples:
what is this: 123,456? Is it [123][456]? Is it 123.456 (with comma as decimal separator)?
What is this: 123;456,789 ? Is it [123][456.789] or is it [123;456][789]?

# Michael S. Kaplan on Monday, September 19, 2005 1:26 PM:

Ben,

I think you are missing the point that others are making here, which is that CSV is not "comma separated values" but "regional options separator separated values".

No one is talking about embedded separators as literals.

# Ben Bryant on Monday, September 19, 2005 4:09 PM:

Thank you so much for your responses Mihai and Michael. I was not aware but now I am dumbfounded by the incredible stupidity of using an OS locale setting as a value separator in a file format. Wow! So if you change a machine's locale you can make all previously exported files unreadable on the same machine they were exported on. That redefines the concept of "short-sighted design" for me.

# Martin Swinkels on Wednesday, March 19, 2008 5:11 AM:

Thanks! This was of great help to me.

# Miguel Ventura on Wednesday, June 04, 2008 7:32 AM:

In Portuguese locale, dot is for grouping and comma for decimals. If I want to export some performance counter log into CSV or tab delimited I won't be able to open it in Excel because it won't get the format correctly. The performance counters CSV files are generated the same way, no matter the locale is...


referenced by

2008/08/22 What good is irony if it can't provide symmetry?

2007/08/15 Excel to Led Zeppelin -- No 'in through the out door'

go to newer or older post, or back to index or month or day