When will this line end? And how?
by Michael S. Kaplan, published on 2005/05/22 19:38 -04:00, original URI: http://blogs.msdn.com/b/michkap/archive/2005/05/22/420890.aspx
I have talked about Chris Walker before.
He is one of guys behind Notepad.exe for several versions, watching this uber-layer around a Win32 EDIT control be morphed into what some consider to be the most-used plain text editor on the planet.
Often when people complain about behavior of international text in Word or Wordpad, I ask them to try it in Notepad -- I can easily determine if the problem is an issue in Word, RichEdit, or Uniscribe in this way).
Anyway, after the first time I had posted about Notepad, Chris had suggested a bunch of interesting topics, and this post is about one of those topics.
How can you tell how a line ends?
Easy on Windows -- just put in a CARRIAGE RETURN followed by a LINEFEED (U+000d U+000a).
Easy in a completely incompatible way on UNIX platforms -- just a LINEFEED (U+000a) and nothing else (the C standard kind of does this, too, thus the rules about files opened in TEXT mode in the C Runtime).
And also easy in a compeletly different, completely incomaptible way for some Apple system, which use the CARRIAGE RETURN (U+000d) alone (although the fact that the newer versions have a UNIX base make me wonder hether all of this is harder on an Apple now given the CR backcompat and the LF platform issue!).
As Raymond Chen discussed last year in Why is the line terminator CR+LF?, there are a lot of people who wished that Notepad dealt with files that had only an LF, since lots of text files (such as the ones in the Unicode Character Database) have a .TXT filetype but Notepad cannot open them directly without assuing the whole file is on one line.
But course it is not Notepad that is responsible for this functionality as much as the system EDIT control, which has its own rules about lines used by messages like EM_GETLINE and EM_GETLINECOUNT. Rules that would need to undergo some pretty big changes if the fundamental plain text definition of a line delimiter on Windows platforms ever changed. It would probably have to be a new set of messages, or a mode for the control. Or people could just use WordPad and the RichEdit control, that does the right thing with different line delimiters already. With some very interesting (where interesting is defined as potentially scary!) performance concerns....
Fixing an occurrence of this problem was actually one of the changes I was able to make in the Micrsoft Access Import Text Wizard, which had the same problem for many versions. Then Jet 4.0 came out, with the ability to not only handle the multiple line terminators (which exised before) but also different encodings (which was definitely a new feature). The problem for these prior versions was that the wizard was using VBA's file I/O functions to load its sample text, and VBA is limited to the default system code page and CRLF (so the wizard would either show junk, or throw an error for a single line being too big -- a problem described in the KB in article 149946). It was a pleasure to fix both problems at the same time by getting away from VBA's inflexible file i/o system here. :-)
# Rosyna on 22 May 2005 10:41 PM:
What about The <a href="http://www.fileformat.info/info/unicode/char/0085/index.htm">new
line/control</a> character U+0085?
FWIW, All the OS X text controls (excluding some old, very deprecated QuickDraw controls) will load text with any line ending. New lines get the line format of the file instead of the line format of the OS.
# TheMuuj on 22 May 2005 11:36 PM:
Text control or not, I think notepad should go to some extra trouble to support LF instead of CR-LF. I can't tell you how often I'm annoyed that I have to load WordPad because of a Unix text file. And seeing as CR-LF seems redundant to me (unless you're a printer), I'd have to say that Unix is smarter here.
Now, the big question is, why is it that some pieces of software I have come across over the years have FOUR modes? LF (Unix), CR (Mac), CR-LF (Windows), LF-CR (WTF???)
# Michael S. Kaplan on 23 May 2005 12:36 AM:
Well, one never knows what a future version may bring. But of all of the possible features, CR or LF newline capabilities is not really highest on my list of features I want to see in Notepad.
My #1 wishlist item would be the option to save without a BOM, and my #2 wishlist item would be UTF-32 support (though I am fairly certain that we would have to provide the conversion capability in NLS, first, obviously!).
#3 would be files opened outside of the fileopen dialog to end up in the MRU list under Start|Documents. And so on....
I am not especially tempted to own an Apple, though I will admit that the fact that they stick Limonata in their cafeteria was a temptation to work for them! :-)
# David Smith on 23 May 2005 2:45 AM:
I'm always pleased with your quality posts. Thank you.
# Rosyna on 23 May 2005 4:39 AM:
I really would suggest you take a look (directly) at how OS X handles Unicode stuffs. Not meaning to ruffle feathers or anything but many people only develop to the LCD and right now, wrt unicode, Windows is the LCD since other OS's were "forced" (based on sheer market saturation) to support the windows stuff along with their own.
# Michael S. Kaplan on 23 May 2005 7:05 AM:
As a rule, the people who do work on product are not the same people as the ones who would do competitive research, as this would lead to the potential for taint (or appearance of taint) in ideas and plans.
But it is outrageous to consider support of Unicode on Windows to be the least common denominator, given the extensive nature of the offerings (especially the extensions involved with the work happening today).
# Maurits [MSFT] on 23 May 2005 11:36 AM:
I must confess I frequently build tab-delimited files where the record delimiter is CRLF, but where individual values contain LFs as internal line breaks. And I use Notepad to verify that the internal line breaks are really just LFs and not full CRLFs. So I am fine with Notepad's current behavior. :)
# Michael S. Kaplan on 23 May 2005 12:26 PM:
This highlights the reason why the change must be opt-in behavior -- because it is best not punish people with expectations on the behavior. :-)
# Serge Wautier on 24 May 2005 3:11 AM:
A few weeks ago, I wrote a rant about CR and LFs.
Michka just wrote about line terminations as well...
# TheMuuj on 27 May 2005 2:50 AM:
Opt-in would be perfectly fine. Why not add a line-break format to the Format or View menu?
Still, I must admit that Windows XP's Notepad is leaps and bounds better than previous offerings. The lack of hotkeys in previous versions was my biggest complaint, and the line number in the status bar is great.
I still wonder what program in its right mind uses LF+CR.
# Michael S. Kaplan on 27 May 2005 3:15 AM:
If it were going to happen (AFAIK it is not, at present?), I think it would likely be an option in the "Save As..." Save file dialog. They are already customizing the template for the encoding dropdown, so adding a dropdown for the Line Return character would be just as feasible.
Though of course that is just the UI, which is the easy part. Adding the functionality would be a special challenge given the implementation of the control. I suspect there would have to be work done in the EDIT control to add an option for multiline controls to select what to use for the line break character(s). And this would be non-trivial....
# TheMuuj on 27 May 2005 10:26 AM:
Actually, saving LF instead of CR+LF is not a big deal to me. I just want to read the files without closing Notepad, right clicking the file, choosing Open With, Wordpad.
If, after opening a Unix text file, I could choose an option to convert all LFs to CR+LFs so that I could read the thing, that would be great. No modifications to the Edit control needed, just a find-and-replace.
Sarcasm: Does Microsoft seriously not use enough cross-platform (or, I suppose, *Nix-centered) applications that this never comes up internally? *grin*
Notepad *used* to load a text file in Wordpad if it was too big (before the size limitation was fixed), perhaps a menu option somewhere to do this manually would also work.
# Michael S. Kaplan on 27 May 2005 12:33 PM:
I would turn the question around and wonder why there are so many *nix users who are using notepad to edit their shell scripts? :-)
You have to realize that while what you need is not the full solution I hint at, that other people have different needs. And when you combine them together in a way that does not make the architectural design a nightmare, you MUST come up with solutions that are not hacks.
The support right now is in the EDIT control. If it is not then it is really not Notepad anymore, it is a new program.
But like I said, I do not even know of an effort to actually try and solve this problem....
# TheMuuj on 27 May 2005 3:23 PM:
Ha! I don't really edit *nix shell scripts. I have been toying with the new Windows Command Shell beta, and I absolutely LOVE it, but it uses Windows line-end markers.
Anyway, this problem mostly comes up when I load a readme.txt or history.txt that comes with an Open Source program, and the file was written with a *nix text editor.
Which is odd, because the ".txt" extension seems superfluous in Linux, so I assume the extension is there to benefit Windows users...but if the file doesn't load properly in Windows then what's the point?
I used to have the problem all the time when editing Quake III config files, but I haven't done that in a long time.
go to newer or older post, or back to index or month or day