Random clearing of topics from the Suggestion Box

by Michael S. Kaplan, published on 2005/11/02 06:01 -05:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/11/02/487994.aspx


The Suggestion Box seemed to be building up some topics, so I thought I would clear a few of them out here....

Back on May 15th of this year, QFlash asked me:

How to write a txt file as Unicode in .NET?

It is actually pretty straightforward.... if you create a new StreamWriter object, then one of the constructors takes both a filename and an Encoding object.

So if I want to create a UTF-16 text file named IamUnicode.txt, you can just use:

StreamWriter sw = new StreamWriter("c:\\IamUnicode.txt", Encoding.Unicode);

// Write something to the file
sw.WriteLine("something");

sw.Flush();
sw.Close();

and that's it (some people do not call the Flush() method but it just feels safer to me to know that I have done it before I close, it is probsbly not needed).

Now on a bit harder of a note, Martin Kochanski asked me the following back in March:

Before Unicode was as widely used as it is now, users of languages with diacritics had to manage with ASCII (or if, they were lucky, with Latin-1) and whole dialects of character usage grew up as a result. This was especially the case with informal communications such as chats and bulletin boards.

To give the example I know best: Polish needs acute accents on c, s, and z, a dot on the z, tails under a and e, and a line through the lowercase "l", to mention just a few.

Sometimes the accents were left out when they could be inferred, and some adjustments were trivial (eg. represent acute accent with a following apostrophe) but what was really inspiring was that people worked out that some letters that weren't used in Polish, such as q, v and x, could be co-opted and given consistent meanings in Polish completely unrelated to what they normally mean in Latin scripts: thus if x equalled z-dot (I can't whether this was one of the specific equivalences) then a Polish speaker would quickly learn to read x as z-dot without hesitation and to press the x key when he wanted to type z-dot.

The spontaneous evolution of such dialect character sets (the convergent evolution resulting from a strong selection towards mutual comprehensibility) has always struck me as a rather inspiring episode, because "bottom-up", driven by need, and not created by committees. The trouble is that once the need disappears, so do the dialects. I'm hoping that someone somewhere is interested enough in the electronic equivalent of "oral history" to be able to capture and codify these ephemeral character sets before they are forgotten even by the people who used them; and it struck me that some of the people who read this blog might have an interest in this bit of history too.

Now this is a fascinating topic, but one that I have to admit I know just about nothing about. Does anyone know of a place where knowledge all of these kinds of de facto standards might be kept?

Any leads might be interesting or useful....

Another one -- Per Bergland asked just this last August:

I can't understand why after so long time there's still no support for Unicode .cmd/.bat files in cmd.exe.

Since I often use Swedish åäö in my file paths, I have to either resort to firing up edit.exe in a command prompt window (aka DOS Window) or first create a Unicode version A.txt from which I can easily create B.txt by "type"-ing it:
type A.txt >B.txt (unless of course I started the prompt using cmd /u).

So it's not as if cmd.exe is totally Unicode-unaware. Why not batch file support?

The problem (well, one of the problems, as there are many) with cmd.exe is that there is a lot of backcompat fear surrounding changes to it -- because almost any change that does happen can lead to breaks.

But Pat, have you checked out Monad? It is indeed the next generation in the console, and it will support Unicode scripts....

I do not know of any plans to rev. cmd.exe in Vista to support this though; it has to keep running as is, but a major feature like this one is simply a bit too much, I think.

 

This post brought to you by "" (U+0986, a.k.a. BENGALI LETTER AA)


# Heath Stewart on 2 Nov 2005 12:14 PM:

I think it's important to note that when using the static properties of the System.Text.Encoding class (i.e., Encoding.UTF8, Encoding.Unicode, etc.) the encodings identified by byte-order marks (BOMs) will include them by default. For text files this is probably wanted but I've answered many questions in various forums where people are writing to, say, a MemoryStream or a NetworkStream where the protocol dictates a certain encoding and the BOM is probably unwanted and will most likely screw things up. The DICT protocol, for example, always expects UTF-8 but a BOM will result in a server error.

# Michael S. Kaplan on 2 Nov 2005 1:10 PM:

Good point, Heath. Though that can be a topic for another day, the question here was just about text files. :-)

# Marvin on 2 Nov 2005 1:40 PM:

People who have to use non-Latin alphabet languages but stuck with an US-English keyboard (or just not familiar with other layouts) had developed various "transliteration" approaches for a long time. For example there is long tradition of Russian transliteration which lead to tools like this: http://translit.ru/. It is pretty common to see russian text typed in latin letters using this convention (or minor variants) on the net.
I myself don't know russian keyboard layout well so I used MKLC to create a keyboard layout that tries to approximate this approach. ;-)

# Jonathan Wilson on 2 Nov 2005 6:48 PM:

How would changing cmd.exe to support unicode break existing stuff?
Wouldnt it just mean that it would be accepting "all input it does now" plus the now valid unicode input?

# Michael S. Kaplan on 2 Nov 2005 7:16 PM:

It has been proven that ANY change breaks existing stuff. Any change whatsoever. No matter how innocuous, the 1st tester's axiom has been proven to apply to cmd.exe.

# Serge Wautier on 3 Nov 2005 2:15 AM:

> The trouble is that once the need disappears, so do the dialects.

Very interesting point. I believe though that some such uses will continue to be live for quite some time since they kind of became default behaviour.

Another (simpler) occurence of this problem is the use of the decimal point instead of the decimal comma.

http://tinyurl.com/86m85

# Suzanne McCarthy on 6 Nov 2005 12:08 AM:

Hi Mike,

I have posted Martin's question re: dialect character sets since I have some readers who are interested in that sort of thing.

http://abecedaria.blogspot.com/2005/11/dialect-character-sets.html

Suz

# Per Bergland on 8 Nov 2005 5:37 PM:

Eh, if you could fix the spelling of my first name, I'd be happy.
Per is a Swedish form of Peter, btw.
And I still don't understand why a unicode .cmd file with a BOM would cause havoc if it triggered cmd.exe to read 2 bytes at a time instead of 1.
When I look at all the strange extensions to other parts of cmd - e.g. the very strange "for" variants that can even be made to perform like the unix "which" command and find a file in the "PATH" list, or the "set" syntax for string search/replace - I can't understand why adding yet another cmd.exe switch was so hard.
But I guess us non-US users haven't screamed loud enough...

Since I'm typing this on Tiger, I can tell you that the Mac OS X tcsh shell is even worse than cmd.exe in this respect. You can't drop a file onto Terminal and get a correct path if the path contains national characters, and the shell displays "o\314\210" or "o??" for a simple ö. Go figure.

# Michael S. Kaplan on 8 Nov 2005 6:01 PM:

Whoops, sorry about that! (should be fixed now)

The problem with cmd.exe is that any time a change is made, something breaks. It is a rule with very few exceptions.....

go to newer or older post, or back to index or month or day