13-May-97 3:48:24-GMT,1383;000000000011 Received: from Unicode.ORG (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id XAA03046 for ; Mon, 12 May 1997 23:48:23 -0400 (EDT) Received: by Unicode.ORG (NX5.67g/NX3.0M) id AA29045; Mon, 12 May 97 20:13:02 -0700 Message-Id: <9705130313.AA29045@Unicode.ORG> Errors-To: uni-bounce@Unicode.ORG Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 2583 (1997-5-13 03:12:48 GMT) To: Multiple Recipients of Reply-To: "Mark H. David" From: "Unicode Discussion" Date: Mon, 12 May 1997 20:12:48 -0700 (PDT) Subject: Line Separator Character What is the deal with unicode line separator? Why would I want to use it, as opposed to using, say, LF or CRLF? Microsoft's CF_UNICODETEXT clipboard format apparently requires CRLF, and their notepad application displays black blob characters when you feed it the Unicode line separator. I've heard reports that Java similarly misdisplays this character, prefering LF only. What was the idea behind Unicode line separator. Is there any advantage to using it? It seems to be different just to be different. If I chose to use LF or CRLF, at least I'd be compatible with many things. This way I'm compatible with just about nothing. Can anyone provide any further information or insights? 13-May-97 22:32:30-GMT,2225;000000000001 Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id SAA06966; Tue, 13 May 1997 18:32:26 -0400 (EDT) Date: Tue, 13 May 97 18:32:25 EDT From: Frank da Cruz To: "Mark H. David" Cc: Multiple Recipients of Subject: Re: Line Separator Character In-Reply-To: Your message of Mon, 12 May 1997 20:12:48 -0700 (PDT) Message-ID: > What is the deal with unicode line separator? Why would I want to use it, > as opposed to using, say, LF or CRLF? Microsoft's CF_UNICODETEXT clipboard > format apparently requires CRLF, and their notepad application displays > black blob characters when you feed it the Unicode line separator. I've > heard reports that Java similarly misdisplays this character, prefering LF > only. What was the idea behind Unicode line separator. Is there any > advantage to using it? It seems to be different just to be different. If I > chose to use LF or CRLF, at least I'd be compatible with many things. This > way I'm compatible with just about nothing. Can anyone provide any further > information or insights? > I suppose that as the one who proposed the Unicode line separator, I should speak to this one. The following are statements from, or paraphrased from, the Unicode standard: . Unicode encodes plain text; . Plain text should contain enough information to permit the text to be rendered legibly and nothing more; . The appearance of the text depends on an upper level protocol and not on ASCII or ISO control characters, which are retained only for compatibility. . Unicode does not prescribe specific semantics for U+000D (CR) and U+000A (LF); it is left the application to interpret these codes. In other words, without Line Separator U+2028, there would be no canonical way to represent line breaks, as in (e.g.) poetry, in Unicode plain text. Why? Because the semantics of CR, LF, CRLF, and other control characters vary from platform to platform (e.g. Macintosh, UNIX, DOS). Furthermore, the conventions for separating paragraphs are also platform and application-specific. Thus the Paragraph Separator, U+2029. - Frank 14-May-97 5:29:38-GMT,1893;000000000011 Received: from unicode.unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id BAA09866 for ; Wed, 14 May 1997 01:29:37 -0400 (EDT) Received: by unicode.unicode.org (NX5.67g/NX3.0M) id AA01180; Tue, 13 May 97 21:35:01 -0700 Message-Id: <9705140435.AA01180@unicode.unicode.org> Errors-To: uni-bounce@unicode.unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-2022-jp Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2598 (1997-05-14 04:34:36 GMT) To: Multiple Recipients of Reply-To: Adrian Havill From: "Unicode Discussion" Date: Tue, 13 May 1997 21:34:34 -0700 (PDT) Subject: Re: Line Separator Character Unicode Discussion wrote: > In other words, without Line Separator U+2028, there would be no canonical > way to represent line breaks, as in (e.g.) poetry, in Unicode plain text. > Why? Because the semantics of CR, LF, CRLF, and other control characters > vary from platform to platform (e.g. Macintosh, UNIX, DOS). > > Furthermore, the conventions for separating paragraphs are also platform > and application-specific. Thus the Paragraph Separator, U+2029. Does this mean that new applications should refrain from using LF and CR and use the two new control characters instead? How many Unicode applications currently understand the Unicode line and paragraph separators? As for future Unicode apps what about Unicode supporting e-mail apps? Will the upcoming Netscape Communicator (most popular commercial Unicode capable e-mail client I can think of) send e-mail (and understand) with the new markers (providing they're Unicode encoded, of course). (targeted towards the Netscape/Unicode group) -- Adrian Havill Engineering Division, System Planning & Production Section 14-May-97 6:31:04-GMT,2078;000000000001 Received: from malmo.trab.se (malmo.trab.se [131.115.48.10]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id CAA22840 for ; Wed, 14 May 1997 02:31:02 -0400 (EDT) Received: from valinor.malmo.trab.se (valinor.malmo.trab.se [131.115.48.20]) by malmo.trab.se (8.7.5/TRAB-primary-2) with ESMTP id IAA24548; Wed, 14 May 1997 08:31:00 +0200 (MET DST) Received: by valinor.malmo.trab.se (8.7.5/TRM-1-KLIENT); Wed, 14 May 1997 08:30:59 +0200 (MET DST) (MET) Date: Wed, 14 May 1997 08:30:59 +0200 (MET DST) From: Dan Oscarsson Message-Id: <199705140630.IAA20207@valinor.malmo.trab.se> To: unicode@unicode.unicode.org, fdc@watsun.cc.columbia.edu Subject: Re: Line Separator Character Mime-Version: 1.0 Content-MD5: JuyMhI2YbpSiZuTwTb2uNw== Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit > I suppose that as the one who proposed the Unicode line separator, I should > speak to this one. The following are statements from, or paraphrased from, > the Unicode standard: > > . Unicode encodes plain text; > . Plain text should contain enough information to permit the text to be > rendered legibly and nothing more; > . The appearance of the text depends on an upper level protocol and not > on ASCII or ISO control characters, which are retained only for > compatibility. > . Unicode does not prescribe specific semantics for U+000D (CR) and > U+000A (LF); it is left the application to interpret these codes. > > In other words, without Line Separator U+2028, there would be no canonical > way to represent line breaks, as in (e.g.) poetry, in Unicode plain text. > Why? Because the semantics of CR, LF, CRLF, and other control characters > vary from platform to platform (e.g. Macintosh, UNIX, DOS). Why should we use a new Unicode special character for line separator when there is a line separator control character: NL (Next Line) defined in the 0200-0237 range. It would be better to to use that instead of CR/LF and U+2028. It can also be used in 8-bit byte text. Dan 14-May-97 11:30:04-GMT,3255;000000000001 Received: from unicode.unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id HAA17906 for ; Wed, 14 May 1997 07:30:01 -0400 (EDT) Received: by unicode.unicode.org (NX5.67g/NX3.0M) id AA02063; Wed, 14 May 97 03:43:11 -0700 Message-Id: <9705141043.AA02063@unicode.unicode.org> Errors-To: uni-bounce@unicode.unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-2022-jp Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2600 (1997-05-14 10:41:14 GMT) To: Multiple Recipients of Reply-To: Adrian Havill From: "Unicode Discussion" Date: Wed, 14 May 1997 03:41:12 -0700 (PDT) Subject: Re: Line Separator Character Martin J. Duerst wrote: > Email has very strict restrictions on this. You can't send doublebyte > UTF-16 or UCS-2 in Email. CRLF always has to be present as a line > separator. Unicode in Email is possible with UTF-7 (and CRLF as line > separator) or UTF-8 + BASE64/QuotedPrintable (and CRLF...). > Please see RFC 2045/6/7 for this. I'm aware of this. Allow me to clarify: encode the Unicode line and paragraph separators in UTF-7 and transmit no CR and LFs. Some protocols, such as SMTP, have a line limit (998 octets in the case of SMTP). However, as the behavior of CR and LF is system dependent, an e-mail client could theoretically ignore CR LF, etc and go by the UTF-7 encoded Unicode line and paragraph breaks, when RFC2046 says '[i]t should not be necessary to add any line breaks to display "text/plain" correctly....' So why not NOT use them and go with the Unicode ones? I admit, I am not clear as to whether this phrase was referring specifically to the ASCII CR and LF control characters, or was referring to all types of line breaks in general. Is "plain text" Unicode with Unicode line breaks considered to be "text/plain" or "text/enriched" (which requires line breaks)? As there are few legacy Unicode-capable e-mail clients, is it not possible to push to get this functionality added now? Many e-mail clients today have an option which enables them to wrap/not-wrap long lines. Why not add a similar feature for Unicode capable clients, which allows a selection (under the "Unicode section" between "interpret CR and LF codes only", "interpret Unicode line and paragraph breaks only", "interpret both Unicode line and paragraph breaks AND CR and LF codes." (I'd also like a feature in future e-mail clients that says "display Unrenderable Unicode as...") Or am I overlooking something painfully obvious and being obtuse? If so, my apologies for wasting everybody's time. ;-) I can see how adding this kind of functionality might confuse the average end-user. But the current end-user which now has to deal with such cryptic functions such as "encode using MIME quoted-printable" or "8-bit", so I don't see how this functionality could make e-mail clients any more complicated, especially if the defaults are set for them for Unicode. Yet another reason why books like "The Complete Moron's Guide to E-Mail" continue to sell, I guess. (^_^) -- Adrian Havill Engineering Division, System Planning & Production Section 14-May-97 12:33:46-GMT,2283;000000000001 Received: from unicode.unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id IAA27390 for ; Wed, 14 May 1997 08:33:45 -0400 (EDT) Received: by unicode.unicode.org (NX5.67g/NX3.0M) id AA02381; Wed, 14 May 97 05:05:37 -0700 Message-Id: <9705141205.AA02381@unicode.unicode.org> Errors-To: uni-bounce@unicode.unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2601 (1997-05-14 12:03:56 GMT) To: Multiple Recipients of Reply-To: Dan Oscarsson From: "Unicode Discussion" Date: Wed, 14 May 1997 05:03:54 -0700 (PDT) Subject: Re: Line Separator Character > On Tue, 13 May 1997, Dan Oscarson wrote: > > > Why should we use a new Unicode special character for line separator > > when there is a line separator control character: NL (Next Line) defined > > in the 0200-0237 range. It would be better to to use that instead of CR/LF and > > U+2028. It can also be used in 8-bit byte text. > > First: Please don't use octal numbers in an environment where everybody > is firmly used to hexadecimal. I had quite some problems figuring out > what you ment with the 0200-0237 range :-). Well, general use i octal if leading zero, hex if leading 0x, U+ is not hex, also octal is nicer. > > Second: Neither ISO 10646 nor Unicode define the CR control characters. > While for CL, virtually everybody uses the same assignement, and the > codepoints are even named in UNicode (but not in ISO 10646), there > are no stable conventions for CR. Many systems and encodings (Mac, > Windows, UTF-8) use the CR area for graphic characters. Yes, but neither the lower nor the upper range of control chaarcters is defined in ISO 10646, but both places are reserved for them, and there is a standard for both the upper and lower range. If we are going to extend the use of control characters it is better to use the control codes in the 8-bit range, especially if it is something as important as line separator. Then it can be used in many 8-bit character sets too. If is unfortunate that Mac, MS Win and UTF-8 have decided to use the upper control space for other things. Dan 14-May-97 17:26:33-GMT,4002;000000000001 Received: from unicode.unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id NAA24800 for ; Wed, 14 May 1997 13:26:29 -0400 (EDT) Received: by unicode.unicode.org (NX5.67g/NX3.0M) id AA03772; Wed, 14 May 97 10:18:12 -0700 Message-Id: <9705141718.AA03772@unicode.unicode.org> Errors-To: uni-bounce@unicode.unicode.org X-Uml-Sequence: 2603 (1997-05-14 17:17:01 GMT) To: Multiple Recipients of Reply-To: kenw@sybase.com (Kenneth Whistler) From: "Unicode Discussion" Date: Wed, 14 May 1997 10:16:59 -0700 (PDT) Subject: Re: Line Separator Character > > > On Tue, 13 May 1997, Dan Oscarson wrote: > > > > > Why should we use a new Unicode special character for line separator > > > when there is a line separator control character: NL (Next Line) defined > > > in the 0200-0237 range. It would be better to to use that instead of CR/LF and > > > U+2028. It can also be used in 8-bit byte text. > > > > First: Please don't use octal numbers in an environment where everybody > > is firmly used to hexadecimal. I had quite some problems figuring out > > what you ment with the 0200-0237 range :-). > Well, general use i octal if leading zero, hex if leading 0x, U+ is not hex, also > octal is nicer. U+ most assuredly is hex. Not only de facto, but now de jure. I cite from DAM No. 9 to ISO/IEC 10646-1:1: "The full syntax of the notation of a short identifier, in Backus-Naur form, is: {U|u}[{+}xxxx|{-}xxxxxxxx] where "x" represents one hexadecimal digit (0 to 9, A to F, or a to f),..." And I concur with the respondent. Some may agree with you that "octal is nicer", but on this list, octal will generally only confuse instead of communicating. By the way, octal 0200-0237, for those of you following this issue, corresponds to U+0080 - U+009F, also known in ISO documents as the C1 range, and referred to below as the "CR area". So what Dan is suggesting is making use of C1 controls for linebreak control, instead of U+2028 LINE SEPARATOR. > > > > > Second: Neither ISO 10646 nor Unicode define the CR control characters. > > While for CL, virtually everybody uses the same assignement, and the > > codepoints are even named in UNicode (but not in ISO 10646), there > > are no stable conventions for CR. Many systems and encodings (Mac, > > Windows, UTF-8) use the CR area for graphic characters. > Yes, but neither the lower nor the upper range of control chaarcters is defined > in ISO 10646, but both places are reserved for them, and there is a standard for > both the upper and lower range. If we are going to extend the use of > control characters it is better to use the control codes in the 8-bit range, especially > if it is something as important as line separator. Then it can be used in > many 8-bit character sets too. If is unfortunate that Mac, MS Win and UTF-8 have > decided to use the upper control space for other things. Use of C1 controls for 8-bit character sets is a logically separate issue from use of U+2028 LINE SEPARATOR (and U+2029 PARAGRAPH SEPARATOR) in Unicode. You may consider it unfortunate, but it is reality that in a world dominated by IBM, Microsoft, Apple, and even Hewlett-Packard 8-bit character encodings, most 8-bit data makes use of the 0x80..0x9F range for graphic characters. Implementations of the ISO 8859 series are the most notable exceptions. And if you want to talk unfortunate, we wouldn't be having nearly so many problems with the ISO 8-bit character sets if they had been built in the first place with graphic characters in 0x80..0x9F (an extra 32) instead of following an ill-conceived ISO 6937 attempt to extend control functions through character encodings in that space. For example, 8859-1 would have the French characters that are currently missing in it, and 8859-2 would not have had to make the ill-starred compromise between Romanian and Turkish letters! --Ken Whistler > > Dan > 15-May-97 0:10:05-GMT,3379;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id UAA04177 for ; Wed, 14 May 1997 20:10:04 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA04271; Wed, 14 May 97 16:01:54 -0700 Message-Id: <9705142301.AA04271@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2610 (1997-05-14 23:01:30 GMT) To: Multiple Recipients of Reply-To: Mark Davis From: "Unicode Discussion" Date: Wed, 14 May 1997 16:01:28 -0700 (PDT) Subject: Re: Line Separator Character The discription of LINE SEPARATOR and PARAGRAPH SEPARATOR should be clear from the discussions on page 6-72 in The Unicode Standard, Version 2.0. Anyone using The Unicode Standard, Version 1.0 should "upgrade" to Version 2.0. The full current state of the standard is established by that document, supplemented by the Errata information on the Unicode web site (http://unicode.org). (By the way, there is also a listing of the table of contents on the web site.) Mark Unicode Discussion wrote: > > On 13 May 97 at 19:49, Frank da Cruz wrote: > > > . Unicode does not prescribe specific semantics for U+000D (CR) and > > U+000A (LF); it is left the application to interpret these codes. > > > > In other words, without Line Separator U+2028, there would be no canonical > > way to represent line breaks, as in (e.g.) poetry, in Unicode plain text. > > Why? Because the semantics of CR, LF, CRLF, and other control characters > > vary from platform to platform (e.g. Macintosh, UNIX, DOS). > > This sounded good until I looked up U2028 and found the name LINE > SEPARATOR and the comment "may be used to represent this semantic > unambiguously", but no explanation of what the semantic is! (I am > quoting from the 1.0 document, so I apologize in advance if this is > covered in 2.0, which I don't have here.) > > Several interpretations of the idea of LINE SEPARATOR are possible, > the obvious issue being whether a carriage return is implied. The > various EBCDIC character sets use the NEWLINE (NL, X'15') character > to mean "move to the leftmost position of the next line"; most ASCII- > like systems infer one of the motions from the other, or require that > both be specified (CR,LF). I think this all comes from the different > mechanical backgrounds: the EBCDIC concept from the IBM 2741 > terminal, which was incapable of executing a carriage return without > also doing a line feed, but which could line feed and backspace > independent of carriage return, and the various teletypewriter-like > devices which generally had no backspace, but could execute > independent carriage return and line feed. > > So what *is* the semantic represented by U2028 ? Is it perhaps a > higher level semantic than the low level detail of whether to return > to the originating margin ? If so, then presumably the notion of > line feed is also at a lower level, and U2028 might be implemented by > e.g. inserting bullets between the lines of poetry without actually > spacing down the page. Somehow this seems like the wrong level of > stuff to be encoding in a character set standard, though. > > Tony Harminc > tzha0@juts.ccc.amdahl.com 15-May-97 0:31:38-GMT,4425;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id UAA11230 for ; Wed, 14 May 1997 20:31:37 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA04536; Wed, 14 May 97 16:09:45 -0700 Message-Id: <9705142309.AA04536@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2611 (1997-05-14 23:09:27 GMT) To: Multiple Recipients of Reply-To: kenw@sybase.com (Kenneth Whistler) From: "Unicode Discussion" Date: Wed, 14 May 1997 16:09:26 -0700 (PDT) Subject: Re: Line Separator Character > > On 13 May 97 at 19:49, Frank da Cruz wrote: > > > . Unicode does not prescribe specific semantics for U+000D (CR) and > > U+000A (LF); it is left the application to interpret these codes. > > > > In other words, without Line Separator U+2028, there would be no canonical > > way to represent line breaks, as in (e.g.) poetry, in Unicode plain text. > > Why? Because the semantics of CR, LF, CRLF, and other control characters > > vary from platform to platform (e.g. Macintosh, UNIX, DOS). > > This sounded good until I looked up U2028 and found the name LINE > SEPARATOR and the comment "may be used to represent this semantic > unambiguously", but no explanation of what the semantic is! (I am > quoting from the 1.0 document, so I apologize in advance if this is > covered in 2.0, which I don't have here.) >From the Unicode Standard, Version 2.0, p 6-72: "[discussion of paragraph separator...] A line separator indicates that a line-break should occur at this point; although the text continues on the next line, it does not start a new paragraph: no interparagraph line spacing nor paragraphic indentation is applied. Since these are separator codes, it is not necessary to start the first line or paragraph, nor end the last line or paragraph with them." In other words, a U+2028 LINE SEPARATOR is to Unicode plain text formatting approximately as ";" is to Pascal statement syntax. > > Several interpretations of the idea of LINE SEPARATOR are possible, > the obvious issue being whether a carriage return is implied. The > various EBCDIC character sets use the NEWLINE (NL, X'15') character > to mean "move to the leftmost position of the next line"; most ASCII- > like systems infer one of the motions from the other, or require that > both be specified (CR,LF). I think this all comes from the different > mechanical backgrounds: the EBCDIC concept from the IBM 2741 > terminal, which was incapable of executing a carriage return without > also doing a line feed, but which could line feed and backspace > independent of carriage return, and the various teletypewriter-like > devices which generally had no backspace, but could execute > independent carriage return and line feed. No mechanical background is intended or implied. This is one reason to depart from the CR/LF/NL control code legacy. The Unicode LINE SEPARATOR implies a GUI model of text layout and formatting (although it is possible to implement on a terminal or virtual terminal). > > So what *is* the semantic represented by U2028 ? Is it perhaps a > higher level semantic than the low level detail of whether to return > to the originating margin ? If so, then presumably the notion of > line feed is also at a lower level, and U2028 might be implemented by > e.g. inserting bullets between the lines of poetry without actually > spacing down the page. Somehow this seems like the wrong level of > stuff to be encoding in a character set standard, though. It is the minimum information to encode in plain text to make it possible for a formatter (which is at a higher level abstraction, and which, indeed, has notions of margins, line advance, etc.) requires to render lines and paragraph breaks at appropriate places. While no one wants to encode all kinds of formatting details in plain text (it belongs in rich or fancy text protocols), neither does anyone want "plain text" to just be a completely unstructured stream of characters with no expressed or expressable chunking into lines and paragraphs. andifintroductionoflineseparatorandparagraphseparatorin tothecharacterencodingseemsobjectionableforplaintextrem embertoothatpunctuationcasingandspaceswereaddedtowritin gsystemstomakethemmorelegiblekenwhistler > > Tony Harminc > tzha0@juts.ccc.amdahl.com > 15-May-97 1:42:05-GMT,4598;000000000011 Received: from halon.sybase.com (halon.sybase.com [192.138.151.33]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id VAA21873 for ; Wed, 14 May 1997 21:42:03 -0400 (EDT) Received: from smtp1.sybase.com (sybgate.sybase.com [130.214.220.35]) by halon.sybase.com (8.8.4/8.8.4) with SMTP id SAA25446; Wed, 14 May 1997 18:03:53 -0700 (PDT) Received: from birdie.sybase.com by smtp1.sybase.com (4.1/SMI-4.1/SybH3.5-030896) id AA04531; Wed, 14 May 97 18:02:03 PDT Received: by birdie.sybase.com (5.x/SMI-SVR4/SybEC3.5) id AA15003; Wed, 14 May 1997 18:00:36 -0700 Date: Wed, 14 May 1997 18:00:36 -0700 From: kenw@sybase.com (Kenneth Whistler) Message-Id: <9705150100.AA15003@birdie.sybase.com> To: fdc@watsun.cc.columbia.edu Subject: Re: C0 contorls (was: Line Separator Character) Cc: unicode@unicode.org, kenw@sybase.com X-Sun-Charset: US-ASCII > > Does this mean that new applications should refrain from using LF and CR > > and use the two new control characters instead? How many Unicode > > applications currently understand the Unicode line and paragraph > > separators? > > > I would say that this would be the intention of the Unicode standard, in > which traditional control characters are emphatically deprecated. That would > include NL also. Yes. But implementation pressure to keep using CR, LF, or CRLF in their Unicode forms in plain text may result in other outcomes. Cf. Murray's note regarding Microsoft's de facto usage. > > I'm not saying this was necessarily the best decision. Unicode, although a > self-proclaimed "plain text" standard, is nevertheless strongly biased towards > use within systems, rather than between them, and particularly by high-end > "rendering engines" that can handle all the complexities of composed > characters, lookahead, and so forth. Control characters are largely intended > for use in communications, where it has always been necessary to mix pure > information in-band with control codes. Although such usage has long been archaic, replaced by clean communication protocols that transmit arbitrary binary data, or by full-blown device control languages implemented in plain text (e.g. PostScript). But of course "archaic" does not mean obsolete, since no computer communication protocol ever seems to go away. ...Well, maybe paper tape punchcodes. ... > > I don't think the status of control characters in Unicode would have been an > issue if the C0 control characters had been better defined and used > consistently throughout history. If CR (or LF, or CRLF) always meant "end of > line", there would have been no need for the Unicode Line Separator, but the > framers of ASCII did not view it as an internal encoding for files, only as an > interchange code Yep. Note that the only C0 control character with an assumed and required semantics in Unicode 2.0 is U+0009 TAB. Nobody implements a TAB *character* with other than 0x09, and it seemed superfluous to clone one. U+0009 TAB is referenced in the normative Unicode bidi algorithm. > (more thought -- or at least experience -- went into the ISO > C1 control set, but it never really caught on -- how many file systems have > you seen in which NL is the line terminator?). Exactly. The C1 control set is largely ignored, as far as I can tell. > > Unicode is the opposite -- it is an internal encoding, but not an interchange > code. I disagree with the implication of this. Unicode is emphatically intended as an interchange code (as well as an internal encoding, or processing code). It is just not designed to be consistent with C0/byte-oriented transmission protocols. It is an interchange code for plain text, in much the same way that GIF is an interchange code for graphics. I don't much care what layers of other transmission and communication protocols are involved in packing it up and delivering it down the wire, as long as it arrives with the same content that it left with. > It does not contain the control elements to be one, but rather pushes > that off on lower levels of the communications architecture (just as it leaves > rendering issues to higher levels); Unicode is the stuff inside the data > fields of TCP/X.25/ISDN/etc packets. But it's not the code on the wire > between a computer and a terminal or a plain-text printer. Thus, unlike ASCII > or ISO 8859-1 (etc), it can't easily be used in a communications setting > except in combination with "something else" that packages it up for > transmission, and another "something else" that renders it. Agreed. --Ken Whistler > > - Frank > 15-May-97 3:09:12-GMT,1728;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id XAA02618 for ; Wed, 14 May 1997 23:09:11 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA05976; Wed, 14 May 97 19:37:23 -0700 Message-Id: <9705150237.AA05976@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 2614 (1997-05-15 02:37:06 GMT) To: Multiple Recipients of Reply-To: "Mark H. David" From: "Unicode Discussion" Date: Wed, 14 May 1997 19:37:05 -0700 (PDT) Subject: Re: Line Separator Character At 04:01 PM 5/14/97 -0700, you wrote: >The discription of LINE SEPARATOR and PARAGRAPH SEPARATOR should be >clear from the discussions on page 6-72 in The Unicode Standard, Version >2.0. Yes, but could this list be used to get practical advice on implementation and on interpreting what the spec means in the real world? OK, so Unicode recommends LINE SEPARATOR (LS) with the clear description alluded to above. And let's say Java AWT does not handle LS. (That's more or less the report I'm getting, but let's consider this hypothetical for now.) Can we then conclude that Java AWT is actually not Unicode compliant? That is, it does handle line separation, but does not assign this semantics to the appropriate character. I.e., if Java AWT printed black blobs for LF and for LS, meaning that it just can't understand the concept of line breaking, that would be technically Unicode compliant, I guess. But if it actually can do line breaking, but but doesn't do it for LS, then that's non-compliant. Is that correct? 15-May-97 11:12:47-GMT,2670;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id HAA06876 for ; Thu, 15 May 1997 07:12:46 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA06852; Thu, 15 May 97 03:33:47 -0700 Message-Id: <9705151033.AA06852@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2616 (1997-05-15 10:33:04 GMT) To: Multiple Recipients of Reply-To: "Kent Karlsson (\e\d\v \E\D\V \e\x\f \E\X\F \i\I)" From: "Unicode Discussion" Date: Thu, 15 May 1997 03:33:03 -0700 (PDT) Subject: Re: Line Separator Character > "[discussion of paragraph separator...] > A line separator indicates that a line-break should occur at this > point; although the text continues on the next line, it does not > start a new paragraph: no interparagraph line spacing nor paragraphic > indentation is applied. Since these are separator codes, it is not > necessary to start the first line or paragraph, nor end the last line > or paragraph with them." > > In other words, a U+2028 LINE SEPARATOR is to Unicode plain text > formatting approximately as ";" is to Pascal statement syntax. No, but U+2029 PARAGRAPH SEPARATOR ("PS") is. I.e., the PS character should be the normally occurring character to indicate a new paragraph. The LINE SEPARATOR is intended only for *rare* occasions where a new line is strongly(?) advised, such as within a poetic verse, or saying "it is good place to break the line here, but don't start a new paragraph". (This is similar to a soft hyphen.) I don't know how strong the advice is, since it says "line-break should...", not "line-break shall...". If in an HTML-document, I would guess that a U+2029 PARAGRAPH SEPARATOR should be interpreted *exactly* as a

, and a U+2028 LINE SEPARATOR should be interpreted as a
. Maybe the strength of the advice to break should differ between
(always break) and LS (perhaps: break here, if a break is needed and no better place is found), I don't know. I make no argument as to the good- or ill-advisedness of having these characters. I just note that they are there, and may (or should) be used. Also, Unicode is going to be used with "higher level 'protocols'" (such as HTML), and a clarification of the interpretation of the PS and LS characters in such contexts is needed, perhaps exemplified with HTML. (Note that HTML does NOT interpret NL or CR as indicating any kind of line break, except in special circumstances (

).)

		/kent karlsson

15-May-97 16:37:45-GMT,2737;000000000001
Received: from unicode.org (unicode.org [192.195.185.2])
	by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id MAA05854
	for ; Thu, 15 May 1997 12:37:44 -0400 (EDT)
Received: by unicode.org (NX5.67g/NX3.0M)
	id AA07630; Thu, 15 May 97 08:56:34 -0700
Message-Id: <9705151556.AA07630@unicode.org>
Errors-To: uni-bounce@unicode.org
X-Uml-Sequence: 2618 (1997-05-15 15:55:59 GMT)
To: Multiple Recipients of 
Reply-To: Frank da Cruz 
From: "Unicode Discussion" 
Date: Thu, 15 May 1997 08:55:57 -0700 (PDT)
Subject: Re: C0 contorls (was: Line Separator Character)

> > ... Control characters are largely intended
> > for use in communications, where it has always been necessary to mix pure
> > information in-band with control codes.
> 
> Although such usage has long been archaic, replaced by clean communication
> protocols that transmit arbitrary binary data, or by full-blown device
> control languages implemented in plain text (e.g. PostScript). But of
> course "archaic" does not mean obsolete, since no computer communication
> protocol ever seems to go away.
> 
One can argue the merits and tradeoffs of older and newer protocols, but many
of the older ones were quite successful and continue to be by virtue of the
fact that they were unleashed only after a great deal of thought, and often
only after compromise and concensus among diverse groups with conflicting
interests.

I would be very happy if words like "archaic" and "legacy" were dropped from
the lexicon of serious people for use in describing existing practice, and
especially existing practice that conforms to hard-fought and hard-won
national and international standards such as ISO 2022, 8859, or even the early
ANSI standards specifying the use of control characters in communications
protocols, which forms the basis for many of our modern protocols.

These are emotionally-toned marketing terms used by greedy corporations that
want to shame you into discarding systems that work perfectly well and buy new
replacements from them.  Maybe new stuff has its advantages, but personally I
don't think that applying epithets to old stuff is the right way to point that
out.  (This is not directed at Ken -- I'm just airing one of my pet peeves.)

Speaking of which, on the other end of the spectrum is the profligate use of
the word "comply", which once carried some weight because it was used in
connection with the aforementioned hard-won standards, but now is used with
any three-letter acronym that any company can dream up on its own without any
sort of review, quality control, or concensus.

My goodness, nowadays we even have to "comply" with a year!

- Frank

15-May-97 19:16:06-GMT,2249;000000000001
Received: from unicode.org (unicode.org [192.195.185.2])
	by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id PAA10790
	for ; Thu, 15 May 1997 15:16:01 -0400 (EDT)
Received: by unicode.org (NX5.67g/NX3.0M)
	id AA08132; Thu, 15 May 97 11:32:46 -0700
Message-Id: <9705151832.AA08132@unicode.org>
Errors-To: uni-bounce@unicode.org
Mime-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
X-Uml-Sequence: 2620 (1997-05-15 18:32:10 GMT)
To: Multiple Recipients of 
Reply-To: "Martin J. Duerst" 
From: "Unicode Discussion" 
Date: Thu, 15 May 1997 11:32:09 -0700 (PDT)
Subject: Re: Line Separator Character

On Thu, 15 May 1997, Unicode Discussion wrote:

> I agree with this; the best explanation if you know HTML is: 
> 
> U+2029 PARAGRAPH       =  

> U+2028 LINE SEPARATOR =
> > As you say, there needs to be some clarification of the usage of these > with HTML, since they occupy the same roles. RFC 2070 has some explanation on some of the "control"-like characters in Unicode. The main aim when working on RFC 2070 was to assure that some basic quality of display could be achieved for a wide range of languages, and that where possible, things could be brought in alignement with Unicode. Because HTML is not plain text, but plain text with markup, there are two layers. The first is what you see in a raw text editor (you see the markup). There are line breaks there, but to be consistent with the rest of HTML around (according to the reference processing model explained in RFC 2070), these have to be CR, LF, or CRLF. As explained above,

and
are already here for the second level (what you see in a browser). So the above two characters never actually came into play. If there is a need for specification, it would only be preemptive (avoid that different people start to use it for different purposes). There is no place where they currently would be needed. That was different for other things, such as SHY (where we made a recommendation in a Note) and all the "control" characters needed for BIDI and joining (which are very instrumental for certain languages and scripts). Any comments wellcome. Regards, Martin. 15-May-97 20:07:15-GMT,2569;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id QAA22628 for ; Thu, 15 May 1997 16:07:13 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA08194; Thu, 15 May 97 11:36:22 -0700 Message-Id: <9705151836.AA08194@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2621 (1997-05-15 18:35:58 GMT) To: Multiple Recipients of Reply-To: Glen Perkins From: "Unicode Discussion" Date: Thu, 15 May 1997 11:35:57 -0700 (PDT) Subject: Re: Line Separator Character Mark Davis wrote: > > The discription of LINE SEPARATOR and PARAGRAPH SEPARATOR should be > clear from the discussions on page 6-72 in The Unicode Standard, Version > 2.0. > Actually, the description of LINE SEPARATOR doesn't seem to state explicitly whether it means "just advance to the next line" or "both advance to the next line *and* return to the beginning of the line": >From p. 6-72: "A line separator indicates that a line-break should occur at this point; although the text continues on the next line, it does not start a new paragraph: no inter-paragraph line spacing nor paragraphic indentation is applied." I assume that "continues on the next line," implies "continues at the beginning of the next line". That's what the expression "line-break" means to me, but I'm not completely sure that it *has* to have that meaning, and that everyone knows that it has that meaning and no other. It probably ought to be stated explicitly since the question of implied CR is answered differently by unix (LF implies CR) and DOS/Win (LF has a CR welded to it, at least implying that LF by itself wouldn't return to the beginning of the following line.) On old line printers, I had no trouble linefeeding without returning to the beginning of the line, though I've long since forgotten the char used to do so (I was but a child.) ;-) This may just be a nit, but while I'm at it, the definition of the PS includes "this *could* cause, *for example*,..." [emphasis mine.] That sounds as though the PS could just as easily "cause, for example" something else, so maybe the specific behavior of the LS is also "left as an exercise for the reader." Perhaps it *could* include an implied CR in one implementation and not in another, both conforming to the standard. What was the actual intent? __Glen Perkins__ glen.perkins@NativeGuide.com 16-May-97 1:29:53-GMT,1364;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id VAA14542 for ; Thu, 15 May 1997 21:29:52 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA08926; Thu, 15 May 97 14:04:53 -0700 Message-Id: <9705152104.AA08926@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2624 (1997-05-15 21:01:51 GMT) To: Multiple Recipients of Reply-To: kenw@sybase.com (Kenneth Whistler) From: "Unicode Discussion" Date: Thu, 15 May 1997 14:01:49 -0700 (PDT) Subject: Re: Line Separator Character I'll try one more time. U+2028 LINE SEPARATOR indicates the separation of lines. A formatter of Unicode plain text then does with the separated lines what it will do with separated lines. It is not intended to be abstruse. And it should not be considered in the same context as the complexity caused by the intermingling of device control semantics of CR and LF (which after all came from the world of *physical* TTY platen and print head control) and the text formatting semantics of CR, LF, and/or CRLF in Mac, Unix, and/or the DOS/Win worlds as EOL, EOP, newline, and/or line separators. It is precisely because CR and LF are such a mess that Unicode has a LINE SEPARATOR and a PARAGRAPH SEPARATOR distinctly encoded. --Ken 16-May-97 1:30:04-GMT,1690;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id VAA14569 for ; Thu, 15 May 1997 21:30:00 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA08662; Thu, 15 May 97 13:07:09 -0700 Message-Id: <9705152007.AA08662@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2623 (1997-05-15 20:06:13 GMT) To: Multiple Recipients of Reply-To: John Cowan From: "Unicode Discussion" Date: Thu, 15 May 1997 13:06:11 -0700 (PDT) Subject: Re: Line Separator Character Martin J. Duerst wrote: > That was different for other things, such as SHY (where we > made a recommendation in a Note) and all the "control" > characters needed for BIDI and joining (which are very > instrumental for certain languages and scripts). Line Separator and Paragraph Separator are essential for BIDI. Paragraph Separator delimits the maximum scope of text that the BIDI algorithm must consider all at once (roughly stated: even if the line width is infinite, paragraphs are still stacked top to bottom, so there is no need to reverse any text across a paragraph mark). Line Separator also significantly affects BIDI behavior. That said, I think that the suggestion that LS =
and PS =

is very sensible, and BIDI HTML renderers should be licensed to treat

as PS and
as LS for BIDI purposes. (This would be a "higher-level protocol" within the meaning of Unicode 2.0.) -- John Cowan cowan@ccil.org e'osai ko sarji la lojban 16-May-97 1:30:04-GMT,3981;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id VAA14619 for ; Thu, 15 May 1997 21:30:03 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA08530; Thu, 15 May 97 12:44:39 -0700 Message-Id: <9705151944.AA08530@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2622 (1997-05-15 19:44:16 GMT) To: Multiple Recipients of Reply-To: Murray Sargent From: "Unicode Discussion" Date: Thu, 15 May 1997 12:44:14 -0700 (PDT) Subject: FW: Line Separator Character > I can report what MS Word and some other MS software products do. > Microsoft text software typically follows Word's lead. On PCs, > Unicode and ANSI plain text files use CRLF for the End Of Paragraph > (EOP) mark. This is a little different in function from the Unicode > Paragraph Separator (U+2029), since it can exist without being > followed by another paragraph. On the Mac, the plain-text EOP is just > a CR, whereas on Unix it's just a LF. Word97 accepts files with all > three of these choices (but not U+2029, which doesn't translate, > sigh), and translates them to a CR for internal use (including in its > object model) and in Word's .doc file format. Word uses VT (0xB) for > a line separator. This is handy, e.g., when you have numbered > paragraphs and would like to insert a paragraph without the leading > number. > > In RTF (Word's Rich Text Format), CRLFs are used for readability only, > with \par representing the EOP and \line representing the line > separator. Similarly, HTML uses CRLFs for readability only, using >
for the line separator and various paragraph tags for paragraph > identification. For these rich-text formats, the Unicode PS and LS > have no defined role and really shouldn't even be used. > > One advantage of using LF through for CR for EOP, etc., is that > they're relatively efficient to parse: you can single them out as a > group with a single if statement instead of a more lengthy switch > statement. Word uses other ASCII control characters for various > things, e.g., 0x1F for the soft hyphen (instead of 0xAD, sigh) and 7 > for a table cell end. Using CRLF particularly for internal use is a > real pain, since it has some of the navigation problems of DBCS. Note > that it's more complicated to handle than the Unicode surrogates, > since with the latter you always know whether a code is a lead word, > trail word, or neither. With CR you have to check to see if it's > followed by a LF. It gets worse on PCs: a "soft carriage return", > i.e., just a word wrap point is represented by the system edit > controls as a CRCRLF. So before you can conclude that a CR is an EOP, > you have to check the two characters that follow! Similarly for a LF > you have to check the preceding two characters. The silver lining in > all of this is that it's pretty trivial to generalize such text > software to handle the Unicode surrogates since they can tag along > with the CRLF code, thereby keeping the caret where it belongs, etc. > > Personally I like Word's choices and have used them in the RichEdit > 2.0 control, but ideally text software should recognize the Unicode > General Punctuation symbols as well. RichEdit 2.0, for example, does > translate U+2029/U+2028 to CR/VT, respectively, on reading in a file > or pasting plain text. On plain-text output though, it uses CRLF and > VT, respectively. > > Unfortunately at this late date, there isn't any unique approach to > these issues. ASCII has been an amazingly successful character set, > but one of its worst deficiencies has been in not specifying a single > code for an EOP mark. Unix attempted to remedy the problem by using > the LF, but it didn't catch on in general. My favorite among the > alternatives is the lone CR, which as explained above is the default > on the Mac and Word. > > Murray > 16-May-97 18:09:29-GMT,2092;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id OAA29734 for ; Fri, 16 May 1997 14:09:27 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA11467; Fri, 16 May 97 10:29:25 -0700 Message-Id: <9705161729.AA11467@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2626 (1997-05-16 17:28:03 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Fri, 16 May 1997 10:28:00 -0700 (PDT) Subject: Re: Line Separator Character Context: plain text unicode file. Assuming we use LS to separate lines (I guess there's no answer to the question "what should I use"), then doesn't that interact negatively with bidi markup, in particular embedding markups? Ie. I have to reestablish the proper embedding level at each line. Say I have two lines, some English with embedded Yiddish (levels shown here, in logical order): 000 0000 00 00000 RLE 11 1111 NL | English RLE Yiddish NL 11 11111 1 11111 PDF 00 0000 ... | Yiddish PDF English ... Now if the newline (NL in above) is indicated by a LS (\u2028), the bidi state is reset between the lines. If I now start the second line with RLE (so as to say I'm reestablishing an embedding level), I can no longer tell whether I have one embedded segment or two (with a 0-level space between, where the LS is). Could be an issue if I later reformat (reflow) this text (as I might want to do in an editor). As a matter of fact, if the second line (after LS) starts with a strong R2L character and I don't reissue RLE, won't the base level be set to 1? This would put the following English at level 2 (not intended as the English isn't embedded in the Yiddish here, but the other way around). These problems go away if I use any combinations of CR/LF to indicate newline. Another question: does PS imply LS? Or would I end a paragraph with LS PS? I presume it does. Thanks in advance for any clarifications. Pierre lew@nortel.ca 16-May-97 19:12:00-GMT,1113;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id PAA08918 for ; Fri, 16 May 1997 15:11:48 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA11815; Fri, 16 May 97 11:49:36 -0700 Message-Id: <9705161849.AA11815@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2627 (1997-05-16 18:49:12 GMT) To: Multiple Recipients of Reply-To: kenw@sybase.com (Kenneth Whistler) From: "Unicode Discussion" Date: Fri, 16 May 1997 11:49:10 -0700 (PDT) Subject: Re: Line Separator Character Pierre, I'll let the bidi experts respond re the first part of your query. > Another question: does PS imply LS? Presence of a paragraph separator would imply a line break. It does not imply a LS character. > Or would I end a paragraph with LS PS? No. You could, but it would imply presence of a blank line before the end of the paragraph. And keep in mind these are *separators". You don't end a paragraph with anything. You separate two paragraphs by use of a PS. --Ken 16-May-97 22:00:09-GMT,14624;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id SAA11638 for ; Fri, 16 May 1997 18:00:04 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA12353; Fri, 16 May 97 13:10:08 -0700 Message-Id: <9705162010.AA12353@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2630 (1997-05-16 20:09:43 GMT) To: Multiple Recipients of Reply-To: Mark Davis From: "Unicode Discussion" Date: Fri, 16 May 1997 13:09:42 -0700 (PDT) Subject: Re: Line Separator Character Pierre, Doug Felt here at Taligent was kind enough to take a pass at answering your questions. His comments are marked with "**". I have added on in a few places, marked with "@@", but haven't looked at the examples as carefully as Doug. Mark =========================================== All, I've been trying to get a clear picture of what a "plain-text unicode file" should look like (wrt control chars, bidi markup, &c.). By "plain-text unicode file" I mean something that would be output by a plain-text editor, eg. a Unicode-capable vi (Unix) or brief (DOS). No HTML or Web implications (altho such an editor could certainly be used to prepare multi-lingual Web pages). I have prepared a short text (not semantically very meaningful) with mixed directionalites so I can ask some concrete questions. I took the liberty to attach the GIF to this message (about same size as the text). Postscript and GIF versions of this text can also be seen at URL http://www.centrcn.umontreal.ca/~lewis/LJL/uniplain.html Below, the text is shown in logical order (and all in English), with an indication of the language in the postscript page (A=Arabic, E=English, F=French, G=German, Y=Yiddish), and what I believe the levels should be. Some examples of dates. In Yiddish, "Monday, the 24th February 1997". 1 E................................E Y............................Y 000000000000000000000000000000000000011111111111122111111111111222200 In German, "Monday, the 24th Febrary 1997". 2 E.......E G...........................G 0000000000002222222222222222222222222222200 In Arabic, "Saturday March 90\3\10" (March 10, 1990) 3 E.......E A....................A E............E 0000000000001111111111111112222222000000000000000000 "Shindler's List", so is called my favorite film. The jew has in the 4 E.............E Y...............................................Y 12222222222222221111111111111111111111111111111111111111111111111111 ring written: "All who preserve one soul of Israel the book makes up to 5 Y..........Y H......................................................H 11111111111111133333333333333333333333333333333333333333333333333333333 him as if he preserved a whole world.". 6 H..................................H 333333333333333333333333333333333333111 The guest has been in Berlin. He has said: "I am 49 years 7 Y........................................Y G...........G 111111111111111111111111111111111111111111112222222222222 old and am called Boutros". This means in Yiddish: "I am old 49 years and 8 G...............G A.....A Y...................Y Y...................Y 2222222222222222223333333111111111111111111111111111111111111221111111111 am called Boutros" (Pierre in French). 9 Y.......Y B.....B F....F Y.......Y 11111111113333333111222222111111111111 Notes: o Translations are fairly literal (and not always very accurate): just for general orientation. And there are surely imperfections in all but the French (with just my name, I'm pretty safe here). o line 3: I'm not too sure what the logical order of the date in Arabic is. Could be 10\3\90 (levels 2212122 -- three level-2 numbers separated by level-1 backslashes) or 90\3\10 (all at level 2). Not too sure of the exact translation of words either. ** The logical order is, in general, the spoken order. The fields of the date ** would probably appear in the order the putative speaker would say them, ** however this is one place where writing and speaking can diverge. Here ** it depends on the order in which the putative speaker would type them. ** My description of what follows assumes the order you present is correct, ** and the desired appearance is what you present on your web site. ** ** Now as to the levels: This is very long, bear with me. ** ** Solidus (Slash) U+002F is a European Number Separator (ES). ** Reverse Solidus (Backslash) U+005C is Other Neutral (ON). You use ** reverse solidus but I'm not sure if this is to represent mirroring (neither ** character is mirrored). Either way, neither is a strong directional ** character. ** ** If the digits are Roman, by rule P0 all these numbers are treated as ** Arabic Numerals because the preceeding strong directional character ** is Arabic text (the 'h' in March). You may have intended them to be ** Arabic-Indic digits from the start. Either way, the digits are AN. ** ** If you intended Solidus (ES) this is converted to ON by rule P3. So ** either solidus or reverse solidus is ON. ** ** ON between AN is converted to R by rule N3(c). ** ** The quoted string on line 3 is thus "L R... AN AN R AN R AN AN L" where ** the L characters are the quote marks surrounding the text. The ** base line direction is LTR because of the initial L (Roman 'I'), so ** the base level is 0. In rule I1 the levels thus become ** "0 1... 2 2 1 2 1 2 2 0". By application of rule L2 this first becomes ** "Saturday March 09\3\01" as the level 2 runs are reversed, then ** "10\3\90 hcraM yadrutaS" as the levels 1&2 run is reversed. ** ** This is not consistent with the output on your web page. To force the ** date to be formatted left to right assuming this logical order, you'd ** need to force all date characters to L. This can be done either using an LRM ** before the first Roman digit, if the digits are roman, or by surrounding ** the date with LRO..PDF, if the digits are arabic-indic. Note that LRE ** won't work because the reverse solidus, being between two AN, would ** still convert to R, instead of L as desired. ** ** For example, using "Saturday March [LRE]90\3\10[PDF]", ** assuming Arabic-indic digits, would resolve the levels to ** 01111111111111112443434420, progressively resulting in ** "Saturday March 09\3\01" -- level 4 reversed ** "Saturday March 10\3\90" -- levels 3 and above reversed ** "Saturday March 09\3\01" -- levels 2 and above reversed ** "10\3\90 hcraM yadrutaS" -- levels 1 and above reversed ** This is a direct result of the fact that the date is not a ** solid run of left-to-right text, because the solidus is still R. ** ** "Saturday March [LRO]90\3\10[PDF]" however would resolve to ** 01111111111111112222222220, progressively resulting in ** "Saturday March 01\3\09" -- level 2 reversed ** "90\3\10 hcraM yadretaS" -- level 1 reversed. o Quotes aren't the right ones (some should be low quotes, ...). Questions 1) Do the levels in the above make sense (plus/minus some punctuation)? It may be that I've totally misunderstood levels. ** Generally, they make sense, see my discussion above. Text does not ** necessarily change level simply because of a quotation, or because of ** a change in language. So in line 2, the level wouldn't change simply ** because of a switch from English to German, since the German ** characters would be L. Only LRE or LRO would do that. Since you ** don't indicate strong formatting characters, I'd have to assume they ** were present to force the levels you indicate. 2) When embedding L2R in L2R (eg German in English, line 2) or R2L in R2L (eg. Arabic in Yiddish, line 9, or Hebrew in Yiddish, line 5), should I use LRE/PDF and RLE/PDF (even though the direction doesn't change)? ** Generally, you wouldn't need to. 3) The second and third paragraphs are right-aligned (R2L main direction). How do I indicate this? I thought of making each paragraph a block (separating them with PS, paragraph separator), and starting each block with a strong char of the appropriate directionality. In the second paragraph, this would mean starting the block with RLM (since the first letters are English). Ie. if base level is odd, main directionality is R2L and the text is right aligned. Or, other possibility, starting a right-adjusted paragraph with RLE? But then what about a left-adjusted paragraph that starts with R2L text. ** Either way would work. Alignment depends on the base line direction, ** which is determined by the first strong character in the block. The ** explicit directional formatting codes LRE, RLE, LRO, RLO as well as ** RLM and LRM are all strong directional characters. LTR text within ** a RLE embedding will still format LTR, but the overall run of text ** within the embedding will be RTL. 4) What should I use to separate lines? LS or CR or LF or CR/LF? If I use LS, which is a block separator, doesn't that interact negatively with bidi markup (control chars), in particular embedding markups? Ie. I have to reestablish the proper level at each line. And what happens with right alignment? Couldn't this cause confusion. If I have two lines (in logical order) 000 0000 00 00000 RLE 11 1111 LS | English RLE Yiddish LS 11 11111 1 11111 00 0000 ... | Yiddish English ... and reissue an RLE at start of second, I can no longer tell whether I have one embedded segment or two (with a 0-level space between, where the LS is). Could be an issue if I later reformat (reflow) this text (as I might want to do in an editor). As a matter of fact, if the second line (after LS) starts with a strong R2L character and I don't reissue RLE, won't the base level be set to 1? This would put the following English at level 2 (not intended as the English isn't embedded in the Yiddish here, but the other way around). (I haven't read the recent thread on LS very carefully yet, but it's not too reassuring: lots of opinions) @@ The standard is pretty clear. Most of those opinions are from people @@ who have not read it. Think of these characters in terms of what you @@ use in a word processor. @@ For Microsoft word or FrontPage, think of LS as the @@ character that you get with shift-Return @@ (causing no paragraph spacing or indent), @@ and PS as what you get with Return. @@ (on the Mac, this would be option-Return). ** This is a good observation! We believe the current standard is in ** error and should categorize LS as whitespace instead of as a block ** separator. ** ** This would allow LS characters to be inserted wherever whitespace ** appears and not interfere with explicit formatting codes. ** ** That said, the explicit formatting codes are basically intended for static ** text interchange only. They pose several problems for editing. One is that it ** is easy to radically alter the text by inserting, copying, or deleting ** one of these codes. This can reorder the text within the block and ** completely change the text on several lines. Similarly, the default ** base line direction rule can be problematic, as changes to the text at ** the start of a block can change the base line direction. Users might ** have difficulty editing unless the editor provides some support (such ** as assisting the user to insert/delete explicit formatting codes and ** their matching PDFs as a unit). @@ For actual editing of text with different directions, it is far easier to have @@ out-of-band style information with explicit embedding levels, @@ as mentioned briefly on page 3-22. ** ** Additionally, text reordering after levels are computed is done on a ** line by line basis. Depending on where line breaks occur, different ** text may appear on a line, and in different orders. This is independent ** of the issue of how to represent line breaks-- if they are represented ** external to the text (a line break table, based on wrapping to some ** width or character count, say) this still happens. This makes rebreaking ** lines somewhat more of an issue than it is with ASCII text. ** 5) Does PS imply LS? Or would I end a paragraph with LS PS? ** Yes, use only PS to separate paragraphs. 6) Imagine I want to start the third paragraph on a new page. Where do I put the FF (wrt the LS/CR/LF/ and bidi markup in the vicinity)? ** FF is higher-level formatting, you'd have to interpret it separately. @@ In particular, you would definitely interpret it as a block separator. 7) Any specific bidi markup required around the numerals? In the Arabic date: if levels intended are 2212122, would I need extra markup? I would think I would need: LRO number PDF \ LRO number PDF \ LRO number PDF (so that the \s, which are "other neutral", stay at level 1)? ** Almost, see my example above. In your example, the separate runs ** of LTR text would occur in RTL order, reversing the year and day of ** the date from what your example shows. 8) What is the intent (as opposed to the effect which the algo surely makes clear) of RLE and LRE? When are they useful? (Relates to question 1). ** Quoted text where the text itself contains mixed directions is a common ** case. You can see it (implicitly) in the examples for rule L2. The quotes ** logically belong to the surrounding text, and the embedding codes are ** just inside the quotes. @@ In the vast majority of cases, it is not necessary. The important cases are @@ those that Doug mentioned. @@ RLO and LRO are even more infrequent, and are designed to allow for cases @@ such part numbers with mixed numbers and letters, where the character @@ order is forced. 9) A typesetting question. Where do quotes belong in mixed-directionality texts (eg. in line 7)? Should they be at the same level as the text introducing the quote? Or at the level of the text being quoted. On line 7, should the quote be at the end of the line instead of where I put it (in the PS file)? Can't say I'm comfortable with either solution. And what style of quotes does one use? That of the quoting or of the quoted language? ** Quotes are at the same level as the text introducing the quote. @@ In general, you expect the style of the quotes to be the same as the containing @@ text, not the embedded text. However, that is up to the user's choice. Thanks in advance for any clarifications. Pierre lew@nortel.ca 16-May-97 22:09:47-GMT,4669;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id SAA13190 for ; Fri, 16 May 1997 18:09:45 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA12505; Fri, 16 May 97 13:19:41 -0700 Message-Id: <9705162019.AA12505@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 2632 (1997-05-16 20:19:26 GMT) To: Multiple Recipients of Reply-To: "Martin J. Duerst" From: "Unicode Discussion" Date: Fri, 16 May 1997 13:19:24 -0700 (PDT) Subject: Re: Line Separator Character On Fri, 16 May 1997, Pierre Lewis wrote: > Context: plain text unicode file. There are basically two models of plain text. The first is line-oriented, the second is paragraph-oriented. Email or programm code is the traditional example of line-oriented plain text. Descriptive text as it appears in word processors, minus formatting, is the typical example of paragraph- oriented plain text. In traditional encoding (using CR/LF/CRLF) and in "official" Unicode encoding (using PS), the two models are made compatible by treating each line in the line-oriented plain text as a paragraph. On the other hand, the paragraph-oriented model can be reduced to the line-oriented model by splitting lines in a particular layout of the paragraph. This splitting is again done by paragraph separators (CR/LF/CRLF/PS), and not by LS. LS is only used for certain effects in the paragraph-oriented model that occur inside a paragraph. For example, I use it in some wordprocessors to start an new line without having the last line aligned left in a justified paragraph and/or without having the new line alligning indented like a first line of a paragraph. The use to avoid paragraph interspacing has also been mentionned. In summary, LS is an advanced device for paragraph-oriented plain text, and not to be used for line-oriented plain text. That said, let's now look at BIDI: > Assuming we use LS to separate lines (I guess there's no answer to the > question "what should I use"), then doesn't that interact negatively > with bidi markup, in particular embedding markups? Ie. I have to > reestablish the proper embedding level at each line. > > Say I have two lines, some English with embedded Yiddish (levels shown > here, in logical order): > 000 0000 00 00000 RLE 11 1111 NL | English RLE Yiddish NL > 11 11111 1 11111 PDF 00 0000 ... | Yiddish PDF English ... > > Now if the newline (NL in above) is indicated by a LS (\u2028), the > bidi state is reset between the lines. If I now start the second line > with RLE (so as to say I'm reestablishing an embedding level), I can no > longer tell whether I have one embedded segment or two (with a 0-level > space between, where the LS is). Could be an issue if I later reformat > (reflow) this text (as I might want to do in an editor). > > As a matter of fact, if the second line (after LS) starts with a strong > R2L character and I don't reissue RLE, won't the base level be set to 1? > This would put the following English at level 2 (not intended as the > English isn't embedded in the Yiddish here, but the other way around). LS is defined as a block separator, so you are right. When you insert an LS to split the lines, your application could insert arbitrary additional codepoints such as RLE. What it does insert (or not) is outside of the Unicode BIDI spec, which only describes static behaviour (what has to happen when the insertions are done), and not dynamic interactive behaviour (which can be a lot more complex if you want it to follow user's expectations, and given that static BIDI is already difficult, I hope you get the point :-). But when you edit BIDI text, you really should work with paragraph-oriented plain text, without additional LSs. Then everything will run more or less smoothly. Reformatting (reflow) is done automatically and correctly. In those cases where you indeed insert LSs, they will in most cases not be in the middle of text, but at some logical interruption point, without the need for frequent reflow. > These problems go away if I use any combinations of CR/LF to indicate > newline. This might be a solution for some very special cases. But in general, for BIDI you should use paragraph-oriented plain text, with CR/LF/ CRLF/PS as paragraph separators. I'm pretty sure that when Microsoft implements BIDI (or the way they already do it), they will treat CR (what they use internally) as a block separator in the BIDI algorithm. Regards, Martin. 16-May-97 22:25:22-GMT,2878;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id SAA15567 for ; Fri, 16 May 1997 18:25:21 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA12695; Fri, 16 May 97 13:39:12 -0700 Message-Id: <9705162039.AA12695@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 2634 (1997-05-16 20:38:15 GMT) To: Multiple Recipients of Reply-To: "Martin J. Duerst" From: "Unicode Discussion" Date: Fri, 16 May 1997 13:38:13 -0700 (PDT) Subject: Re: Line Separator character On Wed, 14 May 1997, Adrian Havill wrote: > Martin J. Duerst wrote: > > Email has very strict restrictions on this. You can't send doublebyte > > UTF-16 or UCS-2 in Email. CRLF always has to be present as a line > > separator. Unicode in Email is possible with UTF-7 (and CRLF as line > > separator) or UTF-8 + BASE64/QuotedPrintable (and CRLF...). > > Please see RFC 2045/6/7 for this. > > I'm aware of this. Allow me to clarify: encode the Unicode line and > paragraph separators in UTF-7 and transmit no CR and LFs. Some > protocols, such as SMTP, have a line limit (998 octets in the case of > SMTP). SMTP email requires that line breaks be encoded as CRLF for all things that are text (i.e. Content-Type: text/*). The user (or the user agent) is also asked to limit line length to something like 80 characters (actually 80 bytes). > However, as the behavior of CR and LF is system dependent, an e-mail > client could theoretically ignore CR LF, etc and go by the UTF-7 encoded > Unicode line and paragraph breaks, when CR and LF are system dependent, but in mail, it's always CRLF, and mail user agents do the conversion. > RFC2046 says '[i]t should not be necessary to add any line breaks to > display "text/plain" correctly....' That's because text/plain (and all of text/*) is already defined to have these as CRLF, at 'short' intervals. > So why not NOT use them and go with > the Unicode ones? Because that may (or actually will) break some mail software. I know many people don't like that (I don't either), but some things in Internet mail are braindead, and will stay braindead. Too many influential people are too used to the way things are, and too many people are affraid of some software failing to work. Of course, what you can do is to have your local user agent change from CRLF to whatever line breaking convention you use locally, which might very well be the "true" Unicode codes. > As there are few legacy Unicode-capable e-mail clients, is it not > possible to push to get this functionality added now? The problem is not the clients. The problem is all the software that the mail passes from one client to the other. Regards, Martin. 17-May-97 21:28:56-GMT,4627;000000000011 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id RAA05910 for ; Sat, 17 May 1997 17:28:55 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA15437; Sat, 17 May 97 14:09:06 -0700 Message-Id: <9705172109.AA15437@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 2642 (1997-05-17 21:08:44 GMT) To: Multiple Recipients of Reply-To: Edward Cherlin From: "Unicode Discussion" Date: Sat, 17 May 1997 14:08:43 -0700 (PDT) Subject: Re: Line Separator Character "Martin J. Duerst" wrote: >On Fri, 16 May 1997, Pierre Lewis wrote: > > >> Context: plain text unicode file. > >There are basically two models of plain text. The first is line-oriented, >the second is paragraph-oriented. Email or programm code is the traditional >example of line-oriented plain text. Descriptive text as it appears in >word processors, minus formatting, is the typical example of paragraph- >oriented plain text. > >In traditional encoding (using CR/LF/CRLF) and in "official" Unicode >encoding (using PS), the two models are made compatible by treating >each line in the line-oriented plain text as a paragraph. On the other >hand, the paragraph-oriented model can be reduced to the line-oriented >model by splitting lines in a particular layout of the paragraph. >This splitting is again done by paragraph separators (CR/LF/CRLF/PS), >and not by LS. There are actually several other models for files of 7-bit or 8-bit character codes, commonly, but misleadingly, known as ASCII text files. The original model was control of a Teletype machine, where several control characters called for physical movement of the mechanism. Many of the bad habits used in text files are survivals of this model. Others, fortunately, have died out. (I am thinking of some of the uses of control characters in editors meant for hard copy terminals.) CRLF was *required* to initiate a new line, but CR by itself was sometimes used for overstriking (if BS was not available), including underlining and composition of APL characters, and also for imitating typewriter overstrikes such as c| for the cent sign and some accented letters such as u" or e`. HT and FF were very commonly used, and some others, such as SI and SO, less so, but each of these specified a mechanical action. SI and SO allowed a fairly standard way to control some dual-script devices including ASCII/Arabic, ASCII/Cyrillic, APL/ASCII, and other combinations. Many devices used ASCII control characters for new purposes, so that an ASCII character string could specify the hardware behavior needed for bold facing and so on. The actual process of printing might call for translation from a 'text file' to an ASCII command string file which would produce the same printed image by other means. For example, a printer driver for a bidirectional printer could save time by printing alternate lines in reverse order, with LF and some spacing commands between lines. We then had the glass Teletype, or dumb terminal, model, which might treat CR and LF as on mechanical devices, or might treat them both as new line characters, or might do something else. At the same time, 'text files' could still be used to control electronic printers, with varying interpretations of some of the control characters. Now, on computers with GUIs, we have different systems that expect CR, or LF, or CRLF, as the new line signal, and have other interpretations of other control characters. System software vendors are going off in all directions inventing new misinterpretations of Unicode characters and constructing yet other file designs. We want to have a uniform, portable definition of the meaning of a file of 16-bit character codes interpreted as Unicode, or "Unicode text file" for short. At the same time, we have several uses for such files, where different interpretations may be desired. If we want to do this right, I think we have to find the appropriate organization for defining such file formats and uses, and get down to some serious and at times difficult standard making. The Unicode character code standard does not seem to be the right place to do this. -- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein 17-May-97 23:00:51-GMT,6375;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id TAA21108 for ; Sat, 17 May 1997 19:00:50 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA15658; Sat, 17 May 97 15:40:09 -0700 Message-Id: <9705172240.AA15658@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2643 (1997-05-17 22:39:47 GMT) To: Multiple Recipients of Reply-To: Frank da Cruz From: "Unicode Discussion" Date: Sat, 17 May 1997 15:39:45 -0700 (PDT) Subject: Re: Line Separator Character > There are actually several other models for files of 7-bit or 8-bit > character codes, commonly, but misleadingly, known as ASCII text files. > > The original model was control of a Teletype machine, where several control > characters called for physical movement of the mechanism. Many of the bad > habits used in text files are survivals of this model. > I wouldn't call them bad habits necessarily. The primary bone of contention here is the distinction between LF and CR... > CRLF was *required* to initiate a new line, but CR by itself was sometimes > used for overstriking (if BS was not available), including underlining and > composition ... > Right. And LF was used by itself to go down one row. > We then had the glass Teletype, or dumb terminal, model, which might treat > CR and LF as on mechanical devices, or might treat them both as new line > characters... > Actually I think that practically all CRTs treat CR and LF just as the TTY did. CR positions the cursor to the left of the current row, LF moves it down one row. > Now, on computers with GUIs, we have different systems that expect CR, or > LF, or CRLF, as the new line signal, and have other interpretations of > other control characters. > Really the problem started when the UNIX designers decided that it was good idea to have a storage model that was different than the tranmsission model. This allowed some space to be saved on disk, and it made text processing software a bit easier to write. However, it complicated the tty driver by requiring it to substitute CRLF for LF when displaying text files, which in turn has led to all sorts of confusion about "raw" vs "cooked" mode, etc, and the related distinction between NVT vs binary mode in Telnet protocol. (It is a simplification that UNIX was the first disk operating system to store textual files differently than it transmitted them, but it may have been the first *stream-oriented* one to do so -- or at least the one we remember.) Thus CRLF has always been the line terminator in ASCII (in the broad sense of "not EBCDIC") text transmission. Systems that chose to use different internal representations have had the obligation to convert back and forth during transmission. It's interesting to speculate how different the world (of computing) might be today if only a few arbitrary and perhaps whimsical decisions had been made differently decades ago: if UNIX and several other popular platforms had used CRLF rather than LF (or CR) as the line terminator; if DOS had used "forward slash" (/) rather than "backward slash" (\) as the directory separator... How many person-eons of effort have gone into addressing the consequences of these decisions... > HT and FF were very commonly used... > (And still are...) Now there's an interesting point. Unicode has addressed the CR/LF/CRLF confusion with LS and PS, but what about formfeed? Isn't it sometimes just as necessary to specify a hard page break as it is to specify a hard line or paragraph break? I suppose there must be a boundary somewhere between "Trust your rendering engine" and "Mother, Please! I'd rather do it myself!" I don't have a copy handy, and I might be entirely wrong about this, but isn't the Holy Koran a document that must be paginated in a specific way? In any case, the strong Use-A-GUI thrust of Unicode will make it increasingly difficult for certain kinds of people to operate in the ways to which they have become accustomed over the past decades in which plain text was "good enough" save that one could not put lots of languages into it. For example, today I can write a letter that spills over to one or more "second sheets" in plain text and print it on a plain-text printer without a second thought, using any software at all on any platform, embedding hard line, paragraph, and page breaks in it, just as most of us still do with email (except for the page breaks). No "templates", "wizards", "profiles", "preferences", or "Buzzword-1.0 Compliance" involved. I can move this letter to practically any other platform and it will still be perfectly legible and printable -- no export or import or conversion or version skew to worry about. I think a lot of people would be perfectly happy to do the same in a plain-text Unicode world using plain-text Unicode terminals and printers, if there were such things. But there's a bigger issue... The idea that one must embed Unicode in a higher level wrapper (e.g. a Microsoft Word document, or even HTML) to make it useful has a certain frightening consequence: the loss of any expectancy of longevity for our new breed of documents. These higher-level systems will be overwhelmingly proprietary due to the vast amount of coding that must go into them, the voracious nature of the marketplace, etc, and so formats will become obsolete with ever-increasing frequency, and it will become ever harder to extract the plain-text characters -- the substance -- from them. That which is perceived at a critical moment in time to be worthy of preservation will be converted to the new format, the rest discarded or left for decipherment by future generations of information archaeologists. (If you don't believe this is a problem, think about what is happening to our (physical) libraries all over the world at this moment -- get ready to say goodbye forever to five millenia of history that was not worth digitizing.) (And then to do it all over again when the digital formats and media need conversion in another ten years.) (And then again five years after that, etc...) So let's do our part and make some effort to accommodate traditional plain-text applications in Unicode, rather than discourage them :-) - Crank (Oops, I mean Frank) 18-May-97 0:13:19-GMT,2045;000000000001 Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id UAA29722 for ; Sat, 17 May 1997 20:13:18 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA15880; Sat, 17 May 97 16:56:42 -0700 Message-Id: <9705172356.AA15880@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2644 (1997-05-17 23:56:16 GMT) To: Multiple Recipients of Reply-To: Terry Allen From: "Unicode Discussion" Date: Sat, 17 May 1997 16:56:15 -0700 (PDT) Subject: Re: Line Separator Character Frank da Cruz asked: >(And still are...) Now there's an interesting point. Unicode has addressed the CR/LF/CRLF confusion with LS and PS, but what about formfeed? Isn't it sometimes just as necessary to specify a hard page break as it is to specify a hard line or paragraph break? I suppose there must be a boundary somewhere between "Trust your rendering engine" and "Mother, Please! I'd rather do it myself!" I don't have a copy handy, and I might be entirely wrong about this, but isn't the Holy Koran a document that must be paginated in a specific way? It isn't. My Egyptian Qur'an is one continuous text flow; the heading of a surah may even occur right at the bottom of a page. But there are such documents; the example of legal documents was brought up recently wrt SGML style sheets. >From an SGML point of view, I want to separate lines and paragraphs in my SGML markup. That's how I'd expect to obtain longevity for the text, not through LS and PS. CR and LF and SGML's difficulty in dealing with them (now redressed partially in XML) are bad enough. In SGML I can't see using LS or PS. Regards (and thanks for an interesting discussion), Terry Allen Electronic Publishing Consultant tallen[at]sonic.net http://www.sonic.net/~tallen/ Davenport and DocBook: http://www.ora.com/davenport/index.html T.A. at Passage Systems: terry.allen[at]passage.com 18-May-97 8:11:08-GMT,1439;000000000011 Received: from mtshasta.snowcrest.net (mtshasta.snowcrest.net [206.245.192.1]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id EAA07970 for ; Sun, 18 May 1997 04:11:06 -0400 (EDT) Received: from [206.245.192.57] (ttyD0.mtshasta.snowcrest.net [206.245.192.32]) by mtshasta.snowcrest.net (8.8.5/8.6.5) with ESMTP id BAA00515 for ; Sun, 18 May 1997 01:11:02 -0700 (PDT) X-Sender: cherlin@snowcrest.net Message-Id: In-Reply-To: References: Your message of Sat, 17 May 1997 14:08:43 -0700 (PDT) Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Sat, 17 May 1997 18:52:05 -0700 To: Frank da Cruz From: Edward Cherlin Subject: Re: Line Separator Character You wrote: [snip] >So let's do our part and make some effort to accommodate traditional >plain-text applications in Unicode, rather than discourage them :-) > >- Crank (Oops, I mean Frank) As you say. So do you think my suggestion of a formal standard for Unicode text files has merit? -- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein 18-May-97 15:40:32-GMT,1713;000000000001 Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id LAA21787; Sun, 18 May 1997 11:40:31 -0400 (EDT) Date: Sun, 18 May 97 11:40:30 EDT From: Frank da Cruz To: Edward Cherlin Subject: Re: Line Separator Character In-Reply-To: Your message of Sat, 17 May 1997 14:08:43 -0700 (PDT) Message-ID: Oops, never mind -- it was this: > We want to have a uniform, portable definition of the meaning of a file of > 16-bit character codes interpreted as Unicode, or "Unicode text file" for > short. At the same time, we have several uses for such files, where > different interpretations may be desired. If we want to do this right, I > think we have to find the appropriate organization for defining such file > formats and uses, and get down to some serious and at times difficult > standard making. The Unicode character code standard does not seem to be > the right place to do this. > I'm not sure what you're after. I'm mainly concerned about the continued viability of files containing only graphic characters, spaces, line breaks, paragraph breaks, and formfeeds. Plain, literal text that can contain poetry, tables, source code, you name it, and stays like it is. Pretty much what we have today with 7- and 8-bit plain text, except without the confusion over CRLF/CR/LF, etc. I think that what's really valuable about these files is their self-contained and independent expressiveness -- they don't need a rendering engine, they don't need any special transport protocol -- they contain the text and the minimal control information to be transported and understood universally. - Frank 19-May-97 3:06:29-GMT,1723;000000000001 Received: from orpheus.amdahl.com (orpheus.amdahl.com [129.212.11.6]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id XAA09584 for ; Sun, 18 May 1997 23:06:28 -0400 (EDT) Received: from minerva.amdahl.com by orpheus.amdahl.com with smtp (Smail3.1.29.1 #3) id m0wTImI-0001JvC; Sun, 18 May 97 20:06 PDT Received: from juts.ccc.amdahl.com by minerva.amdahl.com with smtp (Smail3.1.29.1 #5) id m0wTIm0-0002ChC; Sun, 18 May 97 20:06 PDT Received: by juts.ccc.amdahl.com (/\../\ Smail3.1.14.4 #14.6) id ; Sun, 18 May 97 20:06 PDT Message-Id: Comments: Authenticated sender is From: "Tony Harminc" To: "Unicode Discussion" , fdc@watsun.cc.columbia.edu Date: Sun, 18 May 1997 23:04:41 -0400 MIME-Version: 1.0 Content-type: text/plain; charset=US-ASCII Content-transfer-encoding: 7BIT Subject: Re: Line Separator Character Priority: normal In-reply-to: <9705172240.AA15682@unicode.org> X-mailer: Pegasus Mail for Win32 (v2.52) On 17 May 97 at 15:39, Frank da Cruz wrote: > It's interesting to speculate how different the world (of computing) might be > today if only a few arbitrary and perhaps whimsical decisions had been made > differently decades ago: if UNIX and several other popular platforms had used > CRLF rather than LF (or CR) as the line terminator; if DOS had used "forward > slash" (/) rather than "backward slash" (\) as the directory separator... How > many person-eons of effort have gone into addressing the consequences of these > decisions... If the original IBM PC had used EBCDIC instead of ASCII... Tony Harminc 19-May-97 17:48:10-GMT,5906;000000000001 Return-Path: Received: from halon.sybase.com (halon.sybase.com [192.138.151.33]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id NAA16182 for ; Mon, 19 May 1997 13:48:05 -0400 (EDT) Received: from smtp1.sybase.com (sybgate.sybase.com [130.214.220.35]) by halon.sybase.com (8.8.4/8.8.4) with SMTP id KAA10672; Mon, 19 May 1997 10:51:14 -0700 (PDT) Received: from birdie.sybase.com by smtp1.sybase.com (4.1/SMI-4.1/SybH3.5-030896) id AA06870; Mon, 19 May 97 10:49:25 PDT Received: by birdie.sybase.com (5.x/SMI-SVR4/SybEC3.5) id AA17679; Mon, 19 May 1997 10:47:55 -0700 Date: Mon, 19 May 1997 10:47:55 -0700 From: kenw@sybase.com (Kenneth Whistler) Message-Id: <9705191747.AA17679@birdie.sybase.com> To: fdc@watsun.cc.columbia.edu Subject: Unicode plain text (Was: Line Separator Character) Cc: unicode@unicode.org, kenw@sybase.com X-Sun-Charset: US-ASCII Crank, er... Frank, >> HT and FF were very commonly used... >> >(And still are...) Now there's an interesting point. Unicode has addressed >the CR/LF/CRLF confusion with LS and PS, but what about formfeed? Isn't it >sometimes just as necessary to specify a hard page break as it is to specify a >hard line or paragraph break? You can still use U+000C FORM FEED in Unicode plain text, and a renderer that knows about page breaks can do the "right thing", namely whatever it did with ^L for an ASCII text. FORM FEED, like HORIZONTAL TAB, was not considered to be ambiguous enough in usage (unlike CR/LF) to require any separate encoding in Unicode. > In any case, the strong Use-A-GUI thrust of Unicode will make it increasingly > difficult for certain kinds of people to operate in the ways to which they > have become accustomed over the past decades in which plain text was "good > enough" save that one could not put lots of languages into it. The goal of Unicode plain text is to recapture that portability in the encoding, but also allow you to put lots of languages into it. The "Use-A-GUI thrust" of Unicode acknowledges the fact that rendering of complex scripts (including the Latin script with generative use of combining marks) requires logic that is much more amenable to implementation in a GUI framework than in a terminal model. However, appropriate (and very large and useful) subsets of Unicode *can* be implemented with simple rendering models. (Cf. Windows NT until very recently. :-) ) > I can move this letter to practically any > other platform and it will still be perfectly legible and printable -- no > export or import or conversion or version skew to worry about. I think a lot > of people would be perfectly happy to do the same in a plain-text Unicode > world using plain-text Unicode terminals and printers, if there were such > things. That is exactly what Unicode plain text is all about. And, by the way, Notepad on Windows NT was pretty close to being a "plain-text Unicode terminal". > The idea that one must embed Unicode in a higher level wrapper (e.g. a > Microsoft Word document, or even HTML) to make it useful has a certain > frightening consequence: the loss of any expectancy of longevity for our new > breed of documents. There is absolutely nothing new about this. I was warning my linguistic colleagues about the longevity of their documents when they started using WordStar back around 82/83. 7-bit ASCII is the only encoding that stayed stable enough and was widely enough implemented to retain easy transmissibility across the computer generations without the intervention of information archaeologists. Well, 16-bit Unicode plain text is aimed at no less a goal than being the universal wide-ASCII plain text of the 21st century. Grumpy aside: This goal is not helped by people who treat Unicode as a standards dumping ground for assigning numbers to everybody's favorite collection of junk vaguely related to text, or who try to infiltrate mechanisms (such as language tags) that do not belong in plain text. > So let's do our part and make some effort to accommodate traditional > plain-text applications in Unicode, rather than discourage them :-) I agree completely. An excellent example of the appropriate place for a Unicode plain-text editor would be a Java IDE. If someone writes a good Unicode plain-text editor for such an application, it would have wider applicability. (I know I often use the editors of C++ IDE's to create (ASCII) plain text when I don't want it all gummed up as a Word or Frame document.) Ed Cherlin commented: > We want to have a uniform, portable definition of the meaning of a file of > 16-bit character codes interpreted as Unicode, or "Unicode text file" for > short. At the same time, we have several uses for such files, where > different interpretations may be desired. If we want to do this right, I > think we have to find the appropriate organization for defining such file > formats and uses, and get down to some serious and at times difficult > standard making. The Unicode character code standard does not seem to be > the right place to do this. I disagree about the last point. A Unicode plain text file consists of a stream of Unicode characters (and nothing else), interpreted according to the Unicode standard. It should be marked with an initial U+FEFF (though technically that is optional). This much is already clear from the standard, as is the usage of LINE SEPARATOR and PARAGRAPH SEPARATOR for minimal, unambiguous, plain text formatting consistent with the bidi algorithm. The situation is complicated by the two possible byte orders (which is one reason for the U+FEFF) and by the fact that the most widely implemented variant, namely that in Windows NT, chose LSB order instead of MSB order. But other than that, there is not much more to be said about a Unicode plain text file. The usefulness of the concept lies in its simplicity. --Ken Whistler 20-May-97 20:29:52-GMT,4480;000000000011 Return-Path: Received: from mtshasta.snowcrest.net (mtshasta.snowcrest.net [206.245.192.1]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA02464 for ; Tue, 20 May 1997 16:29:41 -0400 (EDT) Received: from [206.245.192.36] (ttyD23.mtshasta.snowcrest.net [206.245.192.67]) by mtshasta.snowcrest.net (8.8.5/8.6.5) with ESMTP id NAA01464; Tue, 20 May 1997 13:29:30 -0700 (PDT) X-Sender: cherlin@snowcrest.net Message-Id: In-Reply-To: References: Your message of Sat, 17 May 1997 14:08:43 -0700 (PDT) Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Date: Mon, 19 May 1997 23:57:56 -0700 To: Frank da Cruz From: Edward Cherlin Subject: Unicode plain text standard? (was Re: Line Separator Character) Cc: unicode@Unicode.ORG >Oops, never mind -- it was this: > >> We want to have a uniform, portable definition of the meaning of a file of >> 16-bit character codes interpreted as Unicode, or "Unicode text file" for >> short. At the same time, we have several uses for such files, where >> different interpretations may be desired. If we want to do this right, I >> think we have to find the appropriate organization for defining such file >> formats and uses, and get down to some serious and at times difficult >> standard making. The Unicode character code standard does not seem to be >> the right place to do this. >> >I'm not sure what you're after. I'm mainly concerned about the continued >viability of files containing only graphic characters, spaces, line breaks, >paragraph breaks, and formfeeds. Plain, literal text that can contain >poetry, tables, source code, you name it, and stays like it is. I can tell you don't know what table building in Sanskrit is like, and you don't understand BIDI direction marking. >Pretty much what we have today with 7- and 8-bit plain text, except without >the confusion over CRLF/CR/LF, etc. and the utter incompatibility of the extra 128 characters in the 8-bit sets between PC DOS, PC Windows, Mac, various Unix definitions, and all the other extended ASCII code sets such as PC code pages and the ISO 8859 series. Files of 8-bit characters are extremely non-portable. Having lived in Korea and Japan, and been a mathematician and APL programmer, I lost all faith in ASCII long ago. It is horribly inadequate for English, and more so for almost any other language, except for various computer programming languages and constructed languages like Lojban, which were deliberately built within the limits of ASCII, or in the old days EBCDIC. >I think that what's really valuable about >these files is their self-contained and independent expressiveness -- they >don't need a rendering engine, they don't need any special transport protocol >-- they contain the text and the minimal control information to be transported >and understood universally. >- Frank I agree on the transport protocol in principle, although today we need UTF-7, UTF-8, and other encodings, but the idea of full Unicode text without a rendering engine won't fly. That's fine for simple alphabetic scripts, and even for Chinese and Japanese. It doesn't work right for RTL scripts (Arabic and Hebrew), especially for mixtures of RTL and LTR, and for scripts that combine characters into larger groups, usually syllables. This includes Korean, all of the Indic scripts, Tibetan, and Ethiopic. Arabic script has a very large dependence on ligatures, some of them quite complex. There are also problems for rendering math expressions in plain text. Then there are various deprecated characters, the private use areas, and the surrogate character mechanism. Anyone who thought the CRLF business was bad should consider how many incompatible choices can be made in Unicode. Yes, it is true that the Unix file model of a sequence of uninterpreted bytes is very general, and so is a file of uninterpreted 16-bit codes, but files have to be interpreted to be useful. We gloss over the amount of interpretation we do on ASCII text files, but we cannot do that with Unicode. -- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein 20-May-97 21:39:41-GMT,7559;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id RAA20335 for ; Tue, 20 May 1997 17:39:38 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA25440; Tue, 20 May 97 13:31:38 -0700 Message-Id: <9705202031.AA25440@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 2653 (1997-05-20 20:29:36 GMT) To: Multiple Recipients of Reply-To: Edward Cherlin From: "Unicode Discussion" Date: Tue, 20 May 1997 13:29:34 -0700 (PDT) Subject: Re: Unicode plain text (Was: Line Separator Character) kenw@sybase.com (Kenneth Whistler) wrote: [snip] >You can still use U+000C FORM FEED in Unicode plain text, and a renderer that >knows about page breaks can do the "right thing", namely whatever it did with >^L for an ASCII text. FORM FEED, like HORIZONTAL TAB, was not considered to >be ambiguous enough in usage (unlike CR/LF) to require any separate encoding >in Unicode. > >> In any case, the strong Use-A-GUI thrust of Unicode will make it >>increasingly >> difficult for certain kinds of people to operate in the ways to which they >> have become accustomed over the past decades in which plain text was "good >> enough" save that one could not put lots of languages into it. > >The goal of Unicode plain text is to recapture that portability in the >encoding, but also allow you to put lots of languages into it. The "Use-A-GUI >thrust" of Unicode acknowledges the fact that rendering of complex scripts >(including the Latin script with generative use of combining marks) requires >logic that is much more amenable to implementation in a GUI framework than in >a terminal model. However, appropriate (and very large and useful) subsets of >Unicode *can* be implemented with simple rendering models. (Cf. Windows NT >until very recently. :-) ) > >> I can move this letter to practically any >> other platform and it will still be perfectly legible and printable -- no >> export or import or conversion or version skew to worry about. I think >>a lot >> of people would be perfectly happy to do the same in a plain-text Unicode >> world using plain-text Unicode terminals and printers, if there were such >> things. The Everson Mono fonts would suit such a product admirably, up to a point. >That is exactly what Unicode plain text is all about. And, by the way, >Notepad on Windows NT was pretty close to being a "plain-text Unicode >terminal". > >> The idea that one must embed Unicode in a higher level wrapper (e.g. a >> Microsoft Word document, or even HTML) to make it useful has a certain >> frightening consequence: the loss of any expectancy of longevity for our new >> breed of documents. > >There is absolutely nothing new about this. I was warning my linguistic >colleagues about the longevity of their documents when they started using >WordStar back around 82/83. 7-bit ASCII is the only encoding that stayed >stable enough and was widely enough implemented to retain easy >transmissibility >across the computer generations without the intervention of information >archaeologists. Well, 16-bit Unicode plain text is aimed at no less a >goal than being the universal wide-ASCII plain text of the 21st century. > [snip] > >> So let's do our part and make some effort to accommodate traditional >> plain-text applications in Unicode, rather than discourage them :-) > >I agree completely. An excellent example of the appropriate place for >a Unicode plain-text editor would be a Java IDE. If someone writes >a good Unicode plain-text editor for such an application, it would >have wider applicability. (I know I often use the editors of C++ >IDE's to create (ASCII) plain text when I don't want it all gummed up >as a Word or Frame document.) > >Ed Cherlin commented: > >> We want to have a uniform, portable definition of the meaning of a file of >> 16-bit character codes interpreted as Unicode, or "Unicode text file" for >> short. At the same time, we have several uses for such files, where >> different interpretations may be desired. If we want to do this right, I >> think we have to find the appropriate organization for defining such file >> formats and uses, and get down to some serious and at times difficult >> standard making. The Unicode character code standard does not seem to be >> the right place to do this. > >I disagree about the last point. A Unicode plain text file consists of >a stream of Unicode characters (and nothing else), interpreted according >to the Unicode standard. It should be marked with an initial U+FEFF (though >technically that is optional). This much is already clear from the standard, >as is the usage of LINE SEPARATOR and PARAGRAPH SEPARATOR for minimal, >unambiguous, plain text formatting consistent with the bidi algorithm. I'm not concerned about where. If the Unicode standard is an acceptable place to do this, I'm in. >The situation is complicated by the two possible byte orders (which is one >reason for the U+FEFF) and by the fact that the most widely implemented >variant, namely that in Windows NT, chose LSB order instead of MSB order. > >But other than that, there is not much more to be said about a Unicode >plain text file. The usefulness of the concept lies in its simplicity. > >--Ken Whistler I disagree about the simplicity of the problem. Some of the leading issues are: byte order in storage and transmission line, paragraph, and page breaks BIDI (Hebrew, Arabic, etc.) non-linear scripts (Indic, Korean, Mongolian, Ethiopian, etc.) multiply accented characters (IPA, math, several human languages) math compatibility characters private use characters control codes other deprecated characters surrogates, especially unpaired surrogate codes non-character values text processing algorithms (sorting, upper and lower case, pattern matching) Full portability of data requires some rules. If there is no standard, users of "Unicode text files" will make every possible choice about each of these issues. CRLF will be nothing in comparison. We have begun to see programs that can handle CRLF, CR alone, and LF alone, either line-by-line or in paragraph format, reading and writing in any option. The range of choices for Unicode is far greater, and I don't want to think about how long it would take to achieve unity if we don't do it now. The process for dealing with byte order is fairly simple in itself, and the standard gives clear conformance requirements. Most of the other issues I listed have thorns, few in some cases, and many in others. When I was in Korea in the 1960s, telegrams were printed linearly, so Koreans can read this form of their script if they have to. Indic scripts, Ethiopic, and a few others, would require special training to read as separate elements in a straight line. Do we wish to say that users of these scripts can't have text files? Do we say we have to come up with a suitable rendering method for Unicode text files including full BIDI and full character-->glyph composition? Do we say that there should be implementation levels? None of these alternatives is quite satisfactory at present. -- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein 20-May-97 22:11:38-GMT,4132;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id SAA25206 for ; Tue, 20 May 1997 18:11:32 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA25784; Tue, 20 May 97 14:49:30 -0700 Message-Id: <9705202149.AA25784@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2655 (1997-05-20 21:49:05 GMT) To: Multiple Recipients of Reply-To: Frank da Cruz From: "Unicode Discussion" Date: Tue, 20 May 1997 14:49:03 -0700 (PDT) Subject: Re: Unicode plain text standard? (was Re: Line Separator Character) > >I'm not sure what you're after. I'm mainly concerned about the continued > >viability of files containing only graphic characters, spaces, line breaks, > >paragraph breaks, and formfeeds. Plain, literal text that can contain > >poetry, tables, source code, you name it, and stays like it is. > > I can tell you don't know what table building in Sanskrit is like, and you > don't understand BIDI direction marking. > Not Sanskrit, certainly, but I know a little about Hebrew by virtue of having devoted some time to issues of Hebrew terminal emulation in the plain-text world, and our Kermit terminal emulators (the software we make here) are quite popular in Israel. But yes, one must go through more than a few contortions on one end or the other (or both) to handle BIDI issues in the terminal/host setting, to the extent that Hebrew is (according to my sources) hardly used at all in email. The contortions involve generation and interpretation of terminal-specific escape sequences for cursor positioning, reversal of writing direction, character insertion, etc, and of course character-set invocation and designation, all of which obviously add up to something more than plain text. So sure, of course I agree that plain streams of text are not adequate for writing systems that are intrinsically bidirectional (like Hebrew) or for which correct rendering is variable and context-dependent (Indic scripts, etc). (So where, you might ask, is Hebrew terminal emulation used? As far as I know, the major application by far is in library information systems like ALEPH; there are some others, like a Hebrew version of the "vi" editor and more recently, Mule (Multilingual EMACS). At one point some years ago I thought (naively) that the very same mechanisms could be used for Arabic (after all, PCs have an Arabic code page), but in practice, as far as I can tell, no speaker of Arabic would be satisfied with a character-cell representation of Arabic text, because of the way characters must change shape depending on their context (as you point out), which is evidently not an issue in Hebrew (although it might be in Yiddish).) > Having lived in Korea and Japan, and been a mathematician and APL > programmer, I lost all faith in ASCII long ago. > Right -- I wasn't suggesting we all revert to ASCII -- the ability to write text in as many languages as possible is why we're here! I am looking for the option to extend the simplicity (and success) of ASCII to Unicode -- or at least to the large subset of it (as Ken said) that can be used "like ASCII". To me this means the ability to compose a plain-text message containing a certain amount of formatting controls like line breaks, paragraph breaks, and page breaks, that are part of the same code, and without application-specific metacodes (SGML tags, Microsoft Word codes, etc). Let Unicode be able to stand on its own! (Of course, also let it be used in other applications -- but that's not the issue.) If additional considerations need to be applied to the world's more complex scripts in order to have a standard universal representation for plain text, to whatever extent the Unicode 2.0 standard does not already suffice, I'm all for it. Let's not repeat the confusing aspects of ASCII -- particularly CRLF/CR/LF semantics, and, as Ed suggests, let's not leave room for this kind of confusion in areas that are new to Unicode. - Frank 21-May-97 0:19:39-GMT,7895;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id UAA14733 for ; Tue, 20 May 1997 20:19:36 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA26328; Tue, 20 May 97 17:02:20 -0700 Message-Id: <9705210002.AA26328@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2656 (1997-05-21 00:01:51 GMT) To: Multiple Recipients of Reply-To: kenw@sybase.com (Kenneth Whistler) From: "Unicode Discussion" Date: Tue, 20 May 1997 17:01:49 -0700 (PDT) Subject: Re: Unicode plain text (Was: Line Separator Character) I (Ken) commented: >But other than that, there is not much more to be said about a Unicode >plain text file. The usefulness of the concept lies in its simplicity. And Ed Cherlin responded: > > I disagree about the simplicity of the problem. And now I think I understand where we were miscommunicating. I was speaking of a Unicode plain text *file*, which I thought was the issue. And for that the issue is simple. A Unicode plain text *file* is Unicode plain text in a file (preferably marked with U+FEFF and in MSB byte order). But what Ed is addressing here is the standardization of the meaning of Unicode *plain text*--an issue which should be considered outside instantiation of that plain text in transmissible computer files. On that point I agree that there are a vast number of issues which require specification and standardization. And I do believe that the Unicode Standard is the correct place to address many of them. I've made the point before that one of the big differences between ISO/IEC 10646 and the Unicode Standard is that 10646 standardizes the encodings and names of the characters, but that the Unicode Standard goes way beyond that and attempts to provide enough information (some normative and some informative) to enable meaningful and transmissible implementations of Unicode plain text. Below is Ed's list of leading issues. I've interspersed my comments indicating what I think the current Unicode Standard's take is on many of them. (Others may disagree, or may feel that things which are not covered should be.) > Some of the leading issues are: > > byte order in storage and transmission Byte order is addressed by the Unicode Standard. > line, paragraph, and page breaks The Unicode Standard specifies LINE SEPARATOR and PARAGRAPH SEPARATOR, but considers page break to be out of scope. > BIDI (Hebrew, Arabic, etc.) The normative bidi algorithm is specified in great detail in the Unicode Standard. > non-linear scripts (Indic, Korean, Mongolian, Ethiopian, etc.) The Unicode Standard considers specification of script behavior to be part of the desired content of the standard. It doesn't do an equally detailed accounting of all cases, mostly due to resource and information constraints. But Devanagari and Tamil script handling are provided in significant detail as a guide to Indian script behavior, and there is an extensive discussion of Arabic script shaping behavior. There is a specification of normative behavior for Hangul combining jamo. If we could get equally detailed expert contributions for each complex script, I expect the inclination of the UTC and the editors would be to include them in the standard, for everybody's benefit. > multiply accented characters (IPA, math, several human languages) This is considered an integral part of the Unicode Standard, and is detailed with both normative and informative sections. > math There is a definite gap here, though the topic has been a continuing one for the UTC. The consensus seems to be that we would like to get a consistent model of plain text math formula construction stated, to make such information exchangeable in Unicode plain text. > compatibility characters These are now completely specified in the Unicode Standard names list. > private use characters Also specified by the standard, although the interpretation of particular usages of private use characters is, by definition, out of scope for the standard. But there has been some effort by people to make available specifications of their particular private or corporate private usage repertoires of private use characters. > control codes If you mean by this, U+0000 .. U+001F, U+0080..U+009F and the control chimera U+007F, then the Unicode Standard does provide a answer. It doesn't try to reinvent control function standards, but it says those characters should be interpreted as if they were 16-bit analogues of the 8-bit encodings of the corresponding control functions. Maybe unsatisfying, but probably the best we can expect, given existing control code usage. > other deprecated characters There may be room for improvement here, but the Unicode Standard has had to tread a little carefully here. There are political consequences in crying out too loudly that xyz are *deprecated* when xyz may be somebody else's favorite set they lobbied hard to get in! > surrogates, especially unpaired surrogate codes Surrogate usage (in general, as opposed to particular encodings for surrogate pairs, none of which exist yet) is fully specified by the Unicode Standard. > non-character values As opposed to unassigned character values, there are only two non-character values in Unicode: 0xFFFE and 0xFFFF. The standard specifies that 0xFFFE is the illegal byte-swapped version of U+FEFF. The use of 0xFFFF is deliberately unspecified and is untransmissible by design. > text processing algorithms (sorting, upper and lower case, pattern matching) Default case mapping is provided as an informative part of the Unicode Standard. Language-specific casing is effectively also a part of the standard, since everybody knows the few instances in question: Turkish i, the debatable French accents, German  ί, etc., and they are discussed in the standard. Beyond that, sorting, pattern matching, etc. are out of scope of the Unicode Standard (though some implementation guidelines are provided), and, in my opinion, appropriately belong to other standards under development. > > Full portability of data requires some rules. If there is no standard, > users of "Unicode text files" will make every possible choice about each of > these issues. CRLF will be nothing in comparison. We have begun to see > programs that can handle CRLF, CR alone, and LF alone, either line-by-line > or in paragraph format, reading and writing in any option. The range of > choices for Unicode is far greater, and I don't want to think about how > long it would take to achieve unity if we don't do it now. Yes, but... The goal is interchangeable plain text that is legible when interpreted and rendered in accord with the standard. The goal is not to force everyone to "spell" multilingual text exactly the same way. The drafters of the Unicode Standard tried to place normative requirements on plain text where failure to do so would lead to complete chaos. Obvious examples are specification that combining marks must follow (not precede) their base character, and specification of the complete bidi algorithm. Failure to specify either of these would clearly have led to uninterpretable gibberish if everyone made up their own rules, and that was clearly understood by the members of the Unicode Technical Committee. But one draws the line somewhere. No one wants to legislate against people, for example, making cross-linguistic puns in text by spelling out Russian words with Latin letters, or any other "inappropriate" or creative usage of the characters at their disposal, once Unicode implementations become more widely available. Half the joy of having universal multilingual text implemented on computers will be seeing what creative and fantastic new inventions millions of users put it to. --Ken Whistler 21-May-97 1:32:55-GMT,2729;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id VAA24596 for ; Tue, 20 May 1997 21:32:53 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA26556; Tue, 20 May 97 18:14:20 -0700 Message-Id: <9705210114.AA26556@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 X-Uml-Sequence: 2657 (1997-05-21 01:13:31 GMT) To: Multiple Recipients of Reply-To: clarkcb@corp.sykes.com From: "Unicode Discussion" Date: Tue, 20 May 1997 18:13:29 -0700 (PDT) Subject: Unicode Plain Text Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by watsun.cc.columbia.edu id VAA24596 I'm a little confused by this recent thread. I get the feeling that some people think Unicode needs additional features to be useable, whereas I think that the necessary features need to be present in Unicode-supporting applications and fonts. Maybe I'm misunderstanding, but I'll continue anyway. I think maybe the problem is that the definition of "plain text" needs some refining with respect to Unicode. To me, a Unicode plain text file would contain ANY Unicode character. It would be the writer's responsibility (together with an input editor, perhaps) to make sure the file contained the minimum necessary information to render correctly, eg. proper placement of directional indicators, etc., and it would in turn be the application's responsibility to render the file in a readable fashion, given the information contained in the file. Keep in mind that even 7-bit ASCII text still must be "rendered" by an editor on the screen. Also, keep in mind that, according to the Unicode Standard, compliance does not necessarily mean full support. An application might not have bidirectional rendering capabilities, but that does not mean that a Unicode file with a mixture or English and Hebrew/Arabic with directional indicators is not a plain text file. What makes a plain text file different from any other electronic document, in my opinion, is the lack vs. the presence of "style" information, such as font, font size, margins, etc., and additionally, in the case of SGML instances, procedural markup. As for usage standards, such as CRLF vs. CR vs. LF vs. LS vs. PS, etc., we have two options: 1. agree on definitive standards now, and support nothing but, or 2. support everything Now, I have done enough programming to know that supporting more means more headaches, but I still feel that the second option is the better one at this time. Feedback? Cary 21-May-97 19:00:12-GMT,5614;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id OAA04479 for ; Wed, 21 May 1997 14:59:51 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA29248; Wed, 21 May 97 11:11:19 -0700 Message-Id: <9705211811.AA29248@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2661 (1997-05-21 18:10:29 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Wed, 21 May 1997 11:10:27 -0700 (PDT) Subject: Re: Unicode plain-text file Doug/Mark, Thanks a lot for your answers. They clarify a lot of things. > ** This is not consistent with the output on your web page. To force the > ** date to be formatted left to right assuming this logical order, you'd > ** need to force all date characters to L. This can be done either using an LRM > ** before the first Roman digit, if the digits are roman, or by surrounding > ** the date with LRO..PDF, if the digits are arabic-indic. Note that LRE > ** won't work because the reverse solidus, being between two AN, would > ** still convert to R, instead of L as desired. I finally had a chance to chat with my Arab friend to whom I owe this short fragment. It is visually correct (on GIF/PS), but my logical ordering was worng. The logical order is 10\3\90. So it seems that things should automatically fall into place with no extra markup. It is a reverse solidus. The digits are arabic-indic (U+066x). So the reverse solidus, an ON, stays R as needed by virtue of the ANs being treated as Rs for the purpose of resolving neutrals. Not simple, but effective. That section of the standard really requires careful reading and exploring :-). > ... So in line 2, the level wouldn't change simply > ** because of a switch from English to German, since the German > ** characters would be L. Only LRE or LRO would do that. Since you > ** don't indicate strong formatting characters, I'd have to assume they > ** were present to force the levels you indicate. The levels as shown are what I believe(d) they should be. I didn't include the required BIDI markup, but would assume that the application that outputs the file for this text would include whatever is necessary to achieve this result. So you assumed correctly. > @@ The standard is pretty clear. Most of those opinions are from people > @@ who have not read it. Think of these characters in terms of what you > @@ use in a word processor. > @@ For Microsoft word or FrontPage, think of LS as the > @@ character that you get with shift-Return > @@ (causing no paragraph spacing or indent), > @@ and PS as what you get with Return. > @@ (on the Mac, this would be option-Return). Thinking in terms of a word processor is what I'm trying to get away from, because it's not really open. (And I live on Unix :-)) When I open up a file using vi on Unix, I can't tell if this file was created with vi, emacs, pine, ed, sed, awk or whatever. There are still issues (CR/LF/CRLF, TAB, FF placement, top 128 codes) with plain-text ASCII files, but still, it is a very useful concept. Imagine if I had to open mail from user A with vi, from user B with emacs, from user C with pine because that's what each used to write to me. It would be chaos. Unfortunately, if we can't agree on some conventions for plain-text Unicode files, we're going to get into this situation to some extent. Right now, if I want to be as flexible as possible (in an editor, say), I have to deal with 4 new-line conventions (maybe 5): CR, LF, CRLF, LS, maybe NL. I have to deal with various placements of FFs. And I may have to deal with various uses and misuses of some of the new codes. > ** This is a good observation! We believe the current standard is in > ** error and should categorize LS as whitespace instead of as a block > ** separator. I'll consider it changed. > ** That said, the explicit formatting codes are basically intended for static > ** text interchange only. They pose several problems for editing. One is that it > ** is easy to radically alter the text by inserting, copying, or deleting I wouldn't let a user directly input/modify BIDI markup! Rather I'd have him/her tell the editor what a piece of text should look like, then let the editor issue whatever markup is required to achieve this at the time the file is written out. > ** FF is higher-level formatting, you'd have to interpret it separately. > @@ In particular, you would definitely interpret it as a block separator. That's one area where I'd love more guidance from Unicode. FF is, I think, a reasonable requirement for plain-text files, so I would have liked Unicode to tell me more about it, or provide a PAS -- page separator. Pierre lew@nortel.ca P.S.1. I was shocked, when I visited the IUC10 Web site, to find HTML pages in Unicode, but no plain-text files. Yes, let Unicode be able to stand on its own (as fdc@watsun.cc.columbia.edu writes)! P.S.2. Btw, one thing I love about "plain-text" files is that they have the best chances of surviving. If I write stuff today that my 3-year old will want to read when he turns 33, my only choice is plain text. To write for him in French, plain-text ASCII (with the Latin1 assumption) is just fine. But if I wanted to add some notes in Greek, Russian or Yiddish, I need more than just the ASCII conventions and Latin1 codepage. P.S.3. Someone in this thread stated that LF was a paragraph separator in Unix. I see it as a line separator. 22-May-97 8:33:13-GMT,1687;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id EAA13940 for ; Thu, 22 May 1997 04:33:12 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA01595; Thu, 22 May 97 01:07:55 -0700 Message-Id: <9705220807.AA01595@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 2666 (1997-05-22 08:07:03 GMT) To: Multiple Recipients of Reply-To: Edward Cherlin From: "Unicode Discussion" Date: Thu, 22 May 1997 01:06:53 -0700 (PDT) Subject: Re: Unicode plain-text file >> ** FF is higher-level formatting, you'd have to interpret it separately. >> @@ In particular, you would definitely interpret it as a block separator. No, no, please, no! Whitespace, please, or some new category. FF can come in the middle of a paragraph, or a sentence, or even a word. >That's one area where I'd love more guidance from Unicode. FF is, I think, >a reasonable requirement for plain-text files, so I would have liked >Unicode to tell me more about it, or provide a PAS -- page separator. >P.S.3. Someone in this thread stated that LF was a paragraph separator >in Unix. I see it as a line separator. Another good example of the confusion we need to prevent. -- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein 22-May-97 9:31:20-GMT,4440;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id FAA19614 for ; Thu, 22 May 1997 05:31:18 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA01475; Thu, 22 May 97 01:04:38 -0700 Message-Id: <9705220804.AA01475@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 2663 (1997-05-22 08:03:46 GMT) To: Multiple Recipients of Reply-To: Edward Cherlin From: "Unicode Discussion" Date: Thu, 22 May 1997 01:03:44 -0700 (PDT) Subject: Unicode plain text standard? (was Re: Line Separator Character) >Oops, never mind -- it was this: > >> We want to have a uniform, portable definition of the meaning of a file of >> 16-bit character codes interpreted as Unicode, or "Unicode text file" for >> short. At the same time, we have several uses for such files, where >> different interpretations may be desired. If we want to do this right, I >> think we have to find the appropriate organization for defining such file >> formats and uses, and get down to some serious and at times difficult >> standard making. The Unicode character code standard does not seem to be >> the right place to do this. >> >I'm not sure what you're after. I'm mainly concerned about the continued >viability of files containing only graphic characters, spaces, line breaks, >paragraph breaks, and formfeeds. Plain, literal text that can contain >poetry, tables, source code, you name it, and stays like it is. I can tell you don't know what table building in Sanskrit is like, and you don't understand BIDI direction marking. >Pretty much what we have today with 7- and 8-bit plain text, except without >the confusion over CRLF/CR/LF, etc. and the utter incompatibility of the extra 128 characters in the 8-bit sets between PC DOS, PC Windows, Mac, various Unix definitions, and all the other extended ASCII code sets such as PC code pages and the ISO 8859 series. Files of 8-bit characters are extremely non-portable. Having lived in Korea and Japan, and been a mathematician and APL programmer, I lost all faith in ASCII long ago. It is horribly inadequate for English, and more so for almost any other language, except for various computer programming languages and constructed languages like Lojban, which were deliberately built within the limits of ASCII, or in the old days EBCDIC. >I think that what's really valuable about >these files is their self-contained and independent expressiveness -- they >don't need a rendering engine, they don't need any special transport protocol >-- they contain the text and the minimal control information to be transported >and understood universally. >- Frank I agree on the transport protocol in principle, although today we need UTF-7, UTF-8, and other encodings, but the idea of full Unicode text without a rendering engine won't fly. That's fine for simple alphabetic scripts, and even for Chinese and Japanese. It doesn't work right for RTL scripts (Arabic and Hebrew), especially for mixtures of RTL and LTR, and for scripts that combine characters into larger groups, usually syllables. This includes Korean, all of the Indic scripts, Tibetan, and Ethiopic. Arabic script has a very large dependence on ligatures, some of them quite complex. There are also problems for rendering math expressions in plain text. Then there are various deprecated characters, the private use areas, and the surrogate character mechanism. Anyone who thought the CRLF business was bad should consider how many incompatible choices can be made in Unicode. Yes, it is true that the Unix file model of a sequence of uninterpreted bytes is very general, and so is a file of uninterpreted 16-bit codes, but files have to be interpreted to be useful. We gloss over the amount of interpretation we do on ASCII text files, but we cannot do that with Unicode. -- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein Ed Cherlin cherlin@cauce.org Support the anti-Spam amendment Text at Free signature--Inquire within. 22-May-97 10:02:07-GMT,7689;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id GAA23212 for ; Thu, 22 May 1997 06:02:05 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA01479; Thu, 22 May 97 01:04:41 -0700 Message-Id: <9705220804.AA01479@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 2664 (1997-05-22 08:04:06 GMT) To: Multiple Recipients of Reply-To: Edward Cherlin From: "Unicode Discussion" Date: Thu, 22 May 1997 01:04:05 -0700 (PDT) Subject: Re: Unicode plain text (Was: Line Separator Character) kenw@sybase.com (Kenneth Whistler) wrote: [snip] >You can still use U+000C FORM FEED in Unicode plain text, and a renderer that >knows about page breaks can do the "right thing", namely whatever it did with >^L for an ASCII text. FORM FEED, like HORIZONTAL TAB, was not considered to >be ambiguous enough in usage (unlike CR/LF) to require any separate encoding >in Unicode. > >> In any case, the strong Use-A-GUI thrust of Unicode will make it >>increasingly >> difficult for certain kinds of people to operate in the ways to which they >> have become accustomed over the past decades in which plain text was "good >> enough" save that one could not put lots of languages into it. > >The goal of Unicode plain text is to recapture that portability in the >encoding, but also allow you to put lots of languages into it. The "Use-A-GUI >thrust" of Unicode acknowledges the fact that rendering of complex scripts >(including the Latin script with generative use of combining marks) requires >logic that is much more amenable to implementation in a GUI framework than in >a terminal model. However, appropriate (and very large and useful) subsets of >Unicode *can* be implemented with simple rendering models. (Cf. Windows NT >until very recently. :-) ) > >> I can move this letter to practically any >> other platform and it will still be perfectly legible and printable -- no >> export or import or conversion or version skew to worry about. I think >>a lot >> of people would be perfectly happy to do the same in a plain-text Unicode >> world using plain-text Unicode terminals and printers, if there were such >> things. The Everson Mono fonts would suit such a product admirably, up to a point. >That is exactly what Unicode plain text is all about. And, by the way, >Notepad on Windows NT was pretty close to being a "plain-text Unicode >terminal". > >> The idea that one must embed Unicode in a higher level wrapper (e.g. a >> Microsoft Word document, or even HTML) to make it useful has a certain >> frightening consequence: the loss of any expectancy of longevity for our new >> breed of documents. > >There is absolutely nothing new about this. I was warning my linguistic >colleagues about the longevity of their documents when they started using >WordStar back around 82/83. 7-bit ASCII is the only encoding that stayed >stable enough and was widely enough implemented to retain easy >transmissibility >across the computer generations without the intervention of information >archaeologists. Well, 16-bit Unicode plain text is aimed at no less a >goal than being the universal wide-ASCII plain text of the 21st century. > [snip] > >> So let's do our part and make some effort to accommodate traditional >> plain-text applications in Unicode, rather than discourage them :-) > >I agree completely. An excellent example of the appropriate place for >a Unicode plain-text editor would be a Java IDE. If someone writes >a good Unicode plain-text editor for such an application, it would >have wider applicability. (I know I often use the editors of C++ >IDE's to create (ASCII) plain text when I don't want it all gummed up >as a Word or Frame document.) > >Ed Cherlin commented: > >> We want to have a uniform, portable definition of the meaning of a file of >> 16-bit character codes interpreted as Unicode, or "Unicode text file" for >> short. At the same time, we have several uses for such files, where >> different interpretations may be desired. If we want to do this right, I >> think we have to find the appropriate organization for defining such file >> formats and uses, and get down to some serious and at times difficult >> standard making. The Unicode character code standard does not seem to be >> the right place to do this. > >I disagree about the last point. A Unicode plain text file consists of >a stream of Unicode characters (and nothing else), interpreted according >to the Unicode standard. It should be marked with an initial U+FEFF (though >technically that is optional). This much is already clear from the standard, >as is the usage of LINE SEPARATOR and PARAGRAPH SEPARATOR for minimal, >unambiguous, plain text formatting consistent with the bidi algorithm. I'm not concerned about where. If the Unicode standard is an acceptable place to do this, I'm in. >The situation is complicated by the two possible byte orders (which is one >reason for the U+FEFF) and by the fact that the most widely implemented >variant, namely that in Windows NT, chose LSB order instead of MSB order. > >But other than that, there is not much more to be said about a Unicode >plain text file. The usefulness of the concept lies in its simplicity. > >--Ken Whistler I disagree about the simplicity of the problem. Some of the leading issues are: byte order in storage and transmission line, paragraph, and page breaks BIDI (Hebrew, Arabic, etc.) non-linear scripts (Indic, Korean, Mongolian, Ethiopian, etc.) multiply accented characters (IPA, math, several human languages) math compatibility characters private use characters control codes other deprecated characters surrogates, especially unpaired surrogate codes non-character values text processing algorithms (sorting, upper and lower case, pattern matching) Full portability of data requires some rules. If there is no standard, users of "Unicode text files" will make every possible choice about each of these issues. CRLF will be nothing in comparison. We have begun to see programs that can handle CRLF, CR alone, and LF alone, either line-by-line or in paragraph format, reading and writing in any option. The range of choices for Unicode is far greater, and I don't want to think about how long it would take to achieve unity if we don't do it now. The process for dealing with byte order is fairly simple in itself, and the standard gives clear conformance requirements. Most of the other issues I listed have thorns, few in some cases, and many in others. When I was in Korea in the 1960s, telegrams were printed linearly, so Koreans can read this form of their script if they have to. Indic scripts, Ethiopic, and a few others, would require special training to read as separate elements in a straight line. Do we wish to say that users of these scripts can't have text files? Do we say we have to come up with a suitable rendering method for Unicode text files including full BIDI and full character-->glyph composition? Do we say that there should be implementation levels? None of these alternatives is quite satisfactory at present. -- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein Ed Cherlin cherlin@cauce.org Support the anti-Spam amendment Text at Free signature--Inquire within. 22-May-97 10:24:51-GMT,10338;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id GAA26796 for ; Thu, 22 May 1997 06:24:49 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA01571; Thu, 22 May 97 01:07:10 -0700 Message-Id: <9705220807.AA01571@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-Uml-Sequence: 2665 (1997-05-22 08:06:33 GMT) To: Multiple Recipients of Reply-To: Edward Cherlin From: "Unicode Discussion" Date: Thu, 22 May 1997 01:06:32 -0700 (PDT) Subject: Re: Unicode plain text (Was: Line Separator Character) Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by watsun.cc.columbia.edu id GAA26796 kenw@sybase.com (Kenneth Whistler), commenting on my previous message, did an admirable job of summarizing the state of the problem of Unicode plain text in terms of what the Unicode standard does and does not cover, and the fact that a standard for use of such files must address many more issues. I (Ed) agree with his summary entirely. My added comments here address the issues of function of editors and renderers. >I (Ken) commented: > >>But other than that, there is not much more to be said about a Unicode >>plain text file. The usefulness of the concept lies in its simplicity. > >And Ed Cherlin responded: > >> >> I disagree about the simplicity of the problem. > >And now I think I understand where we were miscommunicating. I was >speaking of a Unicode plain text *file*, which I thought was the >issue. And for that the issue is simple. A Unicode plain text *file* >is Unicode plain text in a file (preferably marked with U+FEFF >and in MSB byte order). > >But what Ed is addressing here is the standardization of the meaning >of Unicode *plain text*--an issue which should be considered outside >instantiation of that plain text in transmissible computer files. >On that point I agree that there are a vast number of issues which >require specification and standardization. And I do believe that the >Unicode Standard is the correct place to address many of them. I've >made the point before that one of the big differences between ISO/IEC >10646 and the Unicode Standard is that 10646 standardizes the encodings >and names of the characters, but that the Unicode Standard goes way >beyond that and attempts to provide enough information (some >normative and some informative) to enable meaningful and transmissible >implementations of Unicode plain text. > >Below is Ed's list of leading issues. I've interspersed my comments >indicating what I think the current Unicode Standard's take is on >many of them. (Others may disagree, or may feel that things which >are not covered should be.) > >> Some of the leading issues are: >> byte order in storage and transmission > >Byte order is addressed by the Unicode Standard. No problem there. We might want to go further and *require* a byte order mark. >> line, paragraph, and page breaks > >The Unicode Standard specifies LINE SEPARATOR and PARAGRAPH SEPARATOR, >but considers page break to be out of scope. That would have to be addressed, because it will be used. >> BIDI (Hebrew, Arabic, etc.) > >The normative bidi algorithm is specified in great detail in >the Unicode Standard. So Unicode text editors should be required to implement it correctly, if they handle BIDI at all. >> non-linear scripts (Indic, Korean, Mongolian, Ethiopian, etc.) > >The Unicode Standard considers specification of script behavior to >be part of the desired content of the standard. It doesn't do an >equally detailed accounting of all cases, mostly due to resource >and information constraints. But Devanagari and Tamil script >handling are provided in significant detail as a guide to Indian >script behavior, and there is an extensive discussion of Arabic >script shaping behavior. There is a specification >of normative behavior for Hangul combining jamo. If we could get >equally detailed expert contributions for each complex script, >I expect the inclination of the UTC and the editors would be to >include them in the standard, for everybody's benefit. That would be a very great improvement. >> multiply accented characters (IPA, math, several human languages) > >This is considered an integral part of the Unicode Standard, and >is detailed with both normative and informative sections. So should it be required in all editors? I think so. >> math > >There is a definite gap here, though the topic has been a continuing >one for the UTC. The consensus seems to be that we would like to >get a consistent model of plain text math formula construction >stated, to make such information exchangeable in Unicode plain text. There has been some good work on this reported at IUC conferences. An option in an editor, for now anyway. >> compatibility characters > >These are now completely specified in the Unicode Standard names list. It should be possible to use them, but the user should have to choose to activate them. >> private use characters > >Also specified by the standard, although the interpretation of >particular usages of private use characters is, by definition, out >of scope for the standard. But there has been some effort by people >to make available specifications of their particular private or >corporate private usage repertoires of private use characters. I don't know of any particular behavior that could be required of software, other than the option of marking them all as unrecognized. >> control codes > >If you mean by this, U+0000 .. U+001F, U+0080..U+009F and the >control chimera U+007F, then the Unicode Standard does provide >a answer. It doesn't try to reinvent control function standards, >but it says those characters should be interpreted as if they >were 16-bit analogues of the 8-bit encodings of the corresponding >control functions. Maybe unsatisfying, but probably the best we >can expect, given existing control code usage. More precision is required, I think, at least for CR, LF, HT, and FF. >> other deprecated characters > >There may be room for improvement here, but the Unicode Standard >has had to tread a little carefully here. There are political >consequences in crying out too loudly that xyz are *deprecated* >when xyz may be somebody else's favorite set they lobbied hard >to get in! We can't just forbid them, certainly. >> surrogates, especially unpaired surrogate codes > >Surrogate usage (in general, as opposed to particular encodings >for surrogate pairs, none of which exist yet) is fully specified >by the Unicode Standard. OK. Unpaired surrogate codes should be marked in some way in rendering plain text. >> non-character values > >As opposed to unassigned character values, there are only two >non-character values in Unicode: 0xFFFE and 0xFFFF. The standard >specifies that 0xFFFE is the illegal byte-swapped version of >U+FEFF. The use of 0xFFFF is deliberately unspecified and is >untransmissible by design. Why do I think someone is going to decide to use it? :( >> text processing algorithms (sorting, upper and lower case, pattern matching) > >Default case mapping is provided as an informative part of the >Unicode Standard. Language-specific casing is effectively also >a part of the standard, since everybody knows the few instances >in question: Turkish i, the debatable French accents, German ώ, etc., >and they are discussed in the standard. > >Beyond that, sorting, pattern matching, etc. are out of scope of >the Unicode Standard (though some implementation guidelines are >provided), and, in my opinion, appropriately belong to other standards >under development. The question is to some degree whether there is or will be a standard library of string functions, as there has been in C and C++. Of course I recognize that there were many such libraries, and perhaps that is unavoidable. >> Full portability of data requires some rules. If there is no standard, >> users of "Unicode text files" will make every possible choice about each of >> these issues. CRLF will be nothing in comparison. We have begun to see >> programs that can handle CRLF, CR alone, and LF alone, either line-by-line >> or in paragraph format, reading and writing in any option. The range of >> choices for Unicode is far greater, and I don't want to think about how >> long it would take to achieve unity if we don't do it now. > >Yes, but... The goal is interchangeable plain text that is legible >when interpreted and rendered in accord with the standard. The goal >is not to force everyone to "spell" multilingual text exactly the >same way. The drafters of the Unicode Standard tried to place normative >requirements on plain text where failure to do so would lead to >complete chaos. Obvious examples are specification that combining >marks must follow (not precede) their base character, and specification >of the complete bidi algorithm. Failure to specify either of these >would clearly have led to uninterpretable gibberish if everyone >made up their own rules, and that was clearly understood by the >members of the Unicode Technical Committee. I think the best way to discuss this is over some sample texts. I don't know how much time I can put into this, but if I can I will go through the standard and see if I can pick out anything else that might be a problem. >But one draws the line somewhere. No one wants to legislate against >people, for example, making cross-linguistic puns in text by >spelling out Russian words with Latin letters, or any other >"inappropriate" or creative usage of the characters at >their disposal, once Unicode implementations become more widely >available. Half the joy of having universal multilingual text >implemented on computers will be seeing what creative and fantastic >new inventions millions of users put it to. > >--Ken Whistler Think of the smilies we can make. %-] -- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein 22-May-97 22:24:46-GMT,1378;000000000011 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id SAA27950 for ; Thu, 22 May 1997 18:24:43 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA05533; Thu, 22 May 97 13:37:39 -0700 Message-Id: <9705222037.AA05533@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2673 (1997-05-22 20:37:12 GMT) To: Multiple Recipients of Reply-To: "Tony Harminc" From: "Unicode Discussion" Date: Thu, 22 May 1997 13:37:11 -0700 (PDT) Subject: Re: Unicode plain text How do record oriented file systems fit into this discussion ? (Remember those file systems that ruled the world before the UNIX idea of the byte stream came along...) I imagine the short answer is "they don't", and the longer one is something about record oriented files being fine, as long as the semantics of the defined control characters are honoured. What I'm getting at, though, is whether there is anything in the definition of Unicode plain text that disallows such files. Is there a mapping between the out-of-band record markers and Unicode separators ? It seems trivially obvious to map to/from . Or is this something that no one thinks should even be addressed ? Tony Harminc 22-May-97 22:26:48-GMT,2034;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id SAA28229 for ; Thu, 22 May 1997 18:26:46 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA05800; Thu, 22 May 97 14:34:52 -0700 Message-Id: <9705222134.AA05800@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2674 (1997-05-22 21:34:21 GMT) To: Multiple Recipients of Reply-To: Timothy Partridge From: "Unicode Discussion" Date: Thu, 22 May 1997 14:34:17 -0700 (PDT) Subject: Re: Unicode plain-text file In message <9705220812.AA01704@unicode.org> you recently said: > > >> ** FF is higher-level formatting, you'd have to interpret it separately. > >> @@ In particular, you would definitely interpret it as a block separator. > > No, no, please, no! Whitespace, please, or some new category. FF can come > in the middle of a paragraph, or a sentence, or even a word. I'm not sure I understand your reasoning. During rendering a page break can occur anywhere in the same way that a new line may be started anywhere as a line becomes too full. (I'm using anywhere rather loosely.) Wasn't the question about *forcing* a page break - surely this wouldn't normally be done within a paragraph or smaller part. (Or were you thinking of text streams that have already been formatted by some other process but are now plain text with line breaks etc. added by where the formatting process felt they ought to be.) I feel that adding FF may be part of a slippery slope to pretty text. What about starting a new column or keeping text together? Someone else suggested that New Line should just be white space not a block separator. I don't agree - surely a paragraph is (usully) a new line with some extra white space added - this implies the semantics should be similar. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer 22-May-97 23:00:17-GMT,2344;000000000001 Return-Path: Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id TAA04612; Thu, 22 May 1997 19:00:12 -0400 (EDT) Date: Thu, 22 May 97 19:00:11 EDT From: Frank da Cruz To: "Tony Harminc" Cc: Multiple Recipients of Subject: Re: Unicode plain text In-Reply-To: Your message of Thu, 22 May 1997 13:37:11 -0700 (PDT) Message-ID: > How do record oriented file systems fit into this discussion ? > (Remember those file systems that ruled the world before the UNIX > idea of the byte stream came along...) > They are far from dead; IBM VM/CMS and Digital (Open)VMS, to name two, are still widespread. But VM/CMS and other IBM mainframe and midrange operating systems use EBCDIC text encoding and I am not aware of any movement to support Unicode in this setting, at least not internally. In VMS, most text files are record oriented -- usually variable length records, with end of line *implied* for each record, but not recorded in any particular format. This is actually quite a sensible approach, given the wide variety of text-stream formats that abound for no good reason. In principle, it should be just as possible to fill records with Unicode as it is to fill them with ASCII, Latin-1, or JIS X 0208. The VMS file system also supports the notion of "carriage control", of which there are many types (like the once-familiar Fortran Hollerith style, in which the first character specified whether the line was to overprint the previous line, appear on the next line, appear 2 lines down, etc, or start on a new page). The carriage control information, again, is separate from the file's data. So again, in principle, there should be no clash with Unicode. In fact, I think a VMS implementation of Unicode text might be an interesting exercise. But this too begs the question of how to map Unicode plain text into this environment, which in turn calls for a Unicode plain-text standard for such things as page breaks. And no, I don't think this brings us anywhere near any slippery slopes. Page breaks have been an integral part of plain text since the 1950s when we were programming IBM 409 Electric Accounting Machines by sticking little wires into plugboards. - Frank 22-May-97 23:59:19-GMT,1434;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id TAA13839 for ; Thu, 22 May 1997 19:59:18 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA06394; Thu, 22 May 97 15:59:25 -0700 Message-Id: <9705222259.AA06394@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2676 (1997-05-22 22:59:12 GMT) To: Multiple Recipients of Reply-To: kenw@sybase.com (Kenneth Whistler) From: "Unicode Discussion" Date: Thu, 22 May 1997 15:59:10 -0700 (PDT) Subject: Re: Unicode plain-text file Tim Partridge wrote: > Someone else suggested that New Line should just be white space not a block > separator. I don't agree - surely a paragraph is (usully) a new line with > some extra white space added - this implies the semantics should be similar. Please be extra careful here. The suggestion specifically was that U+2028 LINE SEPARATOR (not NL nor LF functioning as newline) should be considered WS (a technical category of the bidi algorithm, not white space as processed, for example in a C preprocessor, or white space meaning unprinted area on a text page) rather than BS (another technical category of the bidi algorithm which is used to determine the boundaries of directional blocks). Cf. pages 3-15 and 3-17 of the Unicode Standard. --Ken Whistler 23-May-97 1:27:28-GMT,3553;000000000011 Return-Path: Received: from mail2.microsoft.com (mail2.microsoft.com [131.107.3.42]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id VAA26026 for ; Thu, 22 May 1997 21:27:28 -0400 (EDT) Received: by INET-02-IMC with Internet Mail Service (5.0.1458.30) id ; Thu, 22 May 1997 18:27:29 -0700 Message-ID: <61CDD2C9A961CF11B6A000805FD40AA90368E0AC@RED-84-MSG.dns.microsoft.com> From: Murray Sargent To: "'Frank da Cruz'" Cc: "'unicode@unicode.org'" Subject: RE: Unicode plain text Date: Thu, 22 May 1997 18:27:26 -0700 X-Priority: 3 X-Mailer: Internet Mail Service (5.0.1458.30) I think page breaks given by (0xC) belong in the block separator category and imply an end of paragraph. Page breaks that come in the middle of a paragraph or word should be called _soft_ page breaks much as we have soft line breaks. We could talk about adding an optional page-break analogous to the optional hyphen (0xAD), but computer folklore of the years clearly indicates that shouldn't be overloaded for this purpose. (Off hand, I don't think an optional pagebreak would be a useful code to have, since you'd really like to have the semantic "eject if within n lines of the page bottom." Such a semantic requires the number n, which doesn't fit into a single code position.) Murray > -----Original Message----- > From: Unicode Discussion [SMTP:unicode@unicode.org] > Sent: Thursday, May 22, 1997 4:00 PM > To: Multiple Recipients of > Subject: Re: Unicode plain text > > > How do record oriented file systems fit into this discussion ? > > (Remember those file systems that ruled the world before the UNIX > > idea of the byte stream came along...) > > > They are far from dead; IBM VM/CMS and Digital (Open)VMS, to name > two, are still widespread. But VM/CMS and other IBM mainframe > and midrange operating systems use EBCDIC text encoding and I am > not aware of any movement to support Unicode in this setting, > at least not internally. > > In VMS, most text files are record oriented -- usually variable > length records, with end of line *implied* for each record, but > not recorded in any particular format. This is actually quite a > sensible approach, given the wide variety of text-stream formats > that abound for no good reason. > > In principle, it should be just as possible to fill records with > Unicode as it is to fill them with ASCII, Latin-1, or JIS X 0208. > > The VMS file system also supports the notion of "carriage control", > of which there are many types (like the once-familiar Fortran > Hollerith style, in which the first character specified whether the > line was to overprint the previous line, appear on the next line, > appear 2 lines down, etc, or start on a new page). The carriage > control information, again, is separate from the file's data. So > again, in principle, there should be no clash with Unicode. > > In fact, I think a VMS implementation of Unicode text might be an > interesting exercise. But this too begs the question of how to > map Unicode plain text into this environment, which in turn calls > for a Unicode plain-text standard for such things as page breaks. > > And no, I don't think this brings us anywhere near any slippery > slopes. > Page breaks have been an integral part of plain text since the 1950s > when we were programming IBM 409 Electric Accounting Machines by > sticking little wires into plugboards. > > - Frank 23-May-97 1:28:50-GMT,4054;000000000001 Return-Path: Received: from halon.sybase.com (halon.sybase.com [192.138.151.33]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id VAA26150 for ; Thu, 22 May 1997 21:28:49 -0400 (EDT) Received: from smtp1.sybase.com (sybgate.sybase.com [130.214.220.35]) by halon.sybase.com (8.8.4/8.8.4) with SMTP id SAA03968; Thu, 22 May 1997 18:32:06 -0700 (PDT) Received: from birdie.sybase.com by smtp1.sybase.com (4.1/SMI-4.1/SybH3.5-030896) id AA28055; Thu, 22 May 97 18:30:19 PDT Received: by birdie.sybase.com (5.x/SMI-SVR4/SybEC3.5) id AA23641; Thu, 22 May 1997 18:28:46 -0700 Date: Thu, 22 May 1997 18:28:46 -0700 From: kenw@sybase.com (Kenneth Whistler) Message-Id: <9705230128.AA23641@birdie.sybase.com> To: fdc@watsun.cc.columbia.edu Subject: Re: Unicode plain text Cc: unicode@unicode.org, kenw@sybase.com X-Sun-Charset: US-ASCII > > > How do record oriented file systems fit into this discussion ? > > (Remember those file systems that ruled the world before the UNIX > > idea of the byte stream came along...) > > [snip] > > In principle, it should be just as possible to fill records with > Unicode as it is to fill them with ASCII, Latin-1, or JIS X 0208. And in practice. The portable Unicode backend library I have written merrily reads and writes Unicode plain text into MVS and VMS filing systems through standard C file interfaces. No problem. I just don't depend on MVS or VMS to provide any specific interpretations of *anything* in those files, nor would I want to, to stay portable. > > The VMS file system also supports the notion of "carriage control", > of which there are many types (like the once-familiar Fortran > Hollerith style, in which the first character specified whether the > line was to overprint the previous line, appear on the next line, > appear 2 lines down, etc, or start on a new page). The carriage > control information, again, is separate from the file's data. So > again, in principle, there should be no clash with Unicode. > > In fact, I think a VMS implementation of Unicode text might be an > interesting exercise. Only *interesting* in the sense you mean if you depended on VMS for anything other than basic system services underneath a C library. To be portable, everything else would be built on layers of support libraries independent of VMS. > But this too begs the question of how to > map Unicode plain text into this environment, which in turn calls > for a Unicode plain-text standard for such things as page breaks. I agree with Tim that page breaks are on the slippery slope to pretty text. Pagination is not necessary for legibility of plain text in the same sense that line breaking (forced in some instances) or paragraph breaking (required among other things for bidi directional control) are. Furthermore, since pagination assumes much more about actual rendering devices, forced pagination is as often a source of illegibility. (Think of all those preformatted documents you've seen at one time or another that on your device display or print with one or two lines spilled over to the next page for each forced page.) I suspect that the device dependency of pagination is one of the reasons why HTML doesn't use a built-in concept of page-break on display or FF. > > And no, I don't think this brings us anywhere near any slippery slopes. > Page breaks have been an integral part of plain text since the 1950s > when we were programming IBM 409 Electric Accounting Machines by > sticking little wires into plugboards. Again, think device dependency here. FF used to literally be the electronic control for the "Form Feed" on a particular device. It moved a mechanical device that shoved paper out and new paper in. In modern Page Description Languages such as PostScript, an operator such as showpage is a high-level operation that dumps a frame buffer to a smart raster device. Trying to control such operations by embedding an FF control character in plain text is pretty klutzy. --Ken > > - Frank > 23-May-97 4:12:51-GMT,4953;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id AAA17192 for ; Fri, 23 May 1997 00:12:50 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA07465; Thu, 22 May 97 20:50:55 -0700 Message-Id: <9705230350.AA07465@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2682 (1997-05-23 03:50:06 GMT) To: Multiple Recipients of Reply-To: Murray Sargent From: "Unicode Discussion" Date: Thu, 22 May 1997 20:50:05 -0700 (PDT) Subject: RE: Unicode plain text But back in the '60s and early '70s we had line printers (with fixed-width characters) and would ship "plain-text" documents to them preformatted with the desired line and page breaks. Such breaks consisted of hard CRLFs and FFs to control the line printer, and they could appear in the middle of a paragraph or word. Similarly these codes create such breaks on most modern printers. So in this sense, an FF can come in the middle of a paragraph or even a word. But this should be something down at the printer device-driver level. It would be a bad choice for file storage (unless it's a printer file). To date, Unicode has avoided defining control characters except for the TAB and NULL, precisely because there were multiple uses for these characters. The Unicode Standard states that "the others may be interpreted according to ISO/IEC 6429". Nevertheless, Frank's recommendation that Unicode fill in some of the other control-character semantics seems compelling, if only on a recommendation basis. We could, for example, enumerate the most common usages of the control characters CR, LF, VT, and FF in contemporary software. Murray > -----Original Message----- > From: Unicode Discussion [SMTP:unicode@unicode.org] > Sent: Thursday, May 22, 1997 6:27 PM > To: Multiple Recipients of > Subject: RE: Unicode plain text > > I think page breaks given by (0xC) belong in the block separator > category and imply an end of paragraph. Page breaks that come in the > middle of a paragraph or word should be called _soft_ page breaks much > as we have soft line breaks. We could talk about adding an optional > page-break analogous to the optional hyphen (0xAD), but computer > folklore of the years clearly indicates that shouldn't be > overloaded for this purpose. (Off hand, I don't think an optional > pagebreak would be a useful code to have, since you'd really like to > have the semantic "eject if within n lines of the page bottom." Such > a > semantic requires the number n, which doesn't fit into a single code > position.) > > Murray > > > -----Original Message----- > > From: Unicode Discussion [SMTP:unicode@unicode.org] > > Sent: Thursday, May 22, 1997 4:00 PM > > To: Multiple Recipients of > > Subject: Re: Unicode plain text > > > > > How do record oriented file systems fit into this discussion ? > > > (Remember those file systems that ruled the world before the UNIX > > > idea of the byte stream came along...) > > > > > They are far from dead; IBM VM/CMS and Digital (Open)VMS, to name > > two, are still widespread. But VM/CMS and other IBM mainframe > > and midrange operating systems use EBCDIC text encoding and I am > > not aware of any movement to support Unicode in this setting, > > at least not internally. > > > > In VMS, most text files are record oriented -- usually variable > > length records, with end of line *implied* for each record, but > > not recorded in any particular format. This is actually quite a > > sensible approach, given the wide variety of text-stream formats > > that abound for no good reason. > > > > In principle, it should be just as possible to fill records with > > Unicode as it is to fill them with ASCII, Latin-1, or JIS X 0208. > > > > The VMS file system also supports the notion of "carriage control", > > of which there are many types (like the once-familiar Fortran > > Hollerith style, in which the first character specified whether the > > line was to overprint the previous line, appear on the next line, > > appear 2 lines down, etc, or start on a new page). The carriage > > control information, again, is separate from the file's data. So > > again, in principle, there should be no clash with Unicode. > > > > In fact, I think a VMS implementation of Unicode text might be an > > interesting exercise. But this too begs the question of how to > > map Unicode plain text into this environment, which in turn calls > > for a Unicode plain-text standard for such things as page breaks. > > > > And no, I don't think this brings us anywhere near any slippery > > slopes. > > Page breaks have been an integral part of plain text since the 1950s > > when we were programming IBM 409 Electric Accounting Machines by > > sticking little wires into plugboards. > > > > - Frank 23-May-97 14:50:25-GMT,5993;000000000001 Return-Path: Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id KAA08237; Fri, 23 May 1997 10:50:06 -0400 (EDT) Date: Fri, 23 May 97 10:50:06 EDT From: Frank da Cruz To: Murray Sargent Cc: "'unicode@unicode.org'" Subject: RE: Unicode plain text In-Reply-To: Your message of Thu, 22 May 1997 18:27:26 -0700 Message-ID: Murray Sargent wrote: > I think page breaks given by (0xC) belong in the block separator > category and imply an end of paragraph. Page breaks that come in the > middle of a paragraph or word should be called _soft_ page breaks much > as we have soft line breaks. ... > This is GUI thinking. Think "plain text", no rendering engines. is a hard, unconditional page break. Think of running off monthly paychecks on your lineprinter, or addressing envelopes (and spelling peoples' names correctly in hundreds of languages -- imagine that!). kenw@sybase.com (Kenneth Whistler) wrote: > > In principle, it should be just as possible to fill records with > > Unicode as it is to fill them with ASCII, Latin-1, or JIS X 0208. > > And in practice. The portable Unicode backend library I have > written merrily reads and writes Unicode plain text into MVS and > VMS filing systems through standard C file interfaces. No problem. > I just don't depend on MVS or VMS to provide any specific interpretations > of *anything* in those files, nor would I want to, to stay portable. > It's funny how the pendulum swings. Back in the old days we didn't even have file systems, just boxes of cards. Then we developed complex file systems based on punched-card ideas (look at your old OS/360 JCL manual). Then we reacted against all of that complexity and said "a file is just a stream of bytes" with imbedded control information. Now the simplicity of the stream approach is coming back to bite us because of all the differing interpretations of the imbedded controls, since no standard was ever set for their use in files. Now we see that there is something to be said for keeping the control information out of band -- it makes it really simple to change coding systems. But anybody who has ever done VMS Record Management System programming knows that the price is complexity and loss of portability. You can't just "copy" a VMS file to DOS or UNIX, you have to "export" it from the file system and convert its record information to the appropriate stream format. Nor can you run an RMS program on a non-VMS system. If we had it all to do over again -- and we do -- we could retain the simplicity of the stream model without the confusion by precisely defining a set of controls that may be imbedded, as we have done for LS and PS. This will allow for both portable data AND portable software. > I agree with Tim that page breaks are on the slippery slope to pretty > text. Pagination is not necessary for legibility of plain text in > the same sense that line breaking (forced in some instances) or > paragraph breaking (required among other things for bidi directional > control) are. Furthermore, since pagination assumes much more > about actual rendering devices, forced pagination is as often a > source of illegibility. (Think of all those preformatted documents > you've seen at one time or another that on your device display or print > with one or two lines spilled over to the next page for each forced > page.) I suspect that the device dependency of pagination is one > of the reasons why HTML doesn't use a built-in concept of page-break > on display or FF. > This is all true, but that does not mean there should be no such thing as a forced page break. Paychecks. Envelopes. Like any tool, a hard page break can be used for good or evil. It's not the tool's fault. > Again, think device dependency here. FF used to literally be the > electronic control for the "Form Feed" on a particular device. It > moved a mechanical device that shoved paper out and new paper in. > Yes, we still do these things. Murray Sargent said: > > But back in the '60s and early '70s we had line printers (with > fixed-width characters) and would ship "plain-text" documents to them > preformatted with the desired line and page breaks. Such breaks > consisted of hard CRLFs and FFs to control the line printer, and they > could appear in the middle of a paragraph or word. Similarly these > codes create such breaks on most modern printers. So in this sense, an > FF can come in the middle of a paragraph or even a word. But this > should be something down at the printer device-driver level. It would > be a bad choice for file storage (unless it's a printer file). > Again, printer files are common practice, and they are not sent only to printers. They are also viewed on terminals, "straight no chaser" or in a text editor, and they are shipped around among diverse platforms. There is no reason to try to stamp out this practice. It has its legitimate uses. > To date, Unicode has avoided defining control characters except for the > TAB and NULL, precisely because there were multiple uses for these > characters. The Unicode Standard states that "the others may be > interpreted according to ISO/IEC 6429". > I agree that ASCII and ISO 6429 control characters are mess, and that is why it is important to precisely define a minimal set for use in Unicode plain text. This might be done by defining semantics for the existing C0 and C1 control characters, or by adding new ones. This will not only make Unicode able to stand on its own, but it will allow export and import of fancy text between incompatible GUI applications. And it will provide a Common Intermediate Representation for plain text that can last for decades, while the corporations slug it out in the marketplace over their three-letter acronyms du jour. - Frank 24-May-97 0:29:40-GMT,1048;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id UAA16623 for ; Fri, 23 May 1997 20:29:39 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA10499; Fri, 23 May 97 16:59:08 -0700 Message-Id: <9705232359.AA10499@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2686 (1997-05-23 23:58:54 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Fri, 23 May 1997 16:58:53 -0700 (PDT) Subject: Re: Unicode plain text In message "Re: Unicode plain text", 'fdc@watsun.cc.columbia.edu' writes: > And no, I don't think this brings us anywhere near any slippery slopes. > Page breaks have been an integral part of plain text since the 1950s > when we were programming IBM 409 Electric Accounting Machines by > sticking little wires into plugboards. I have to agree. Don't RFCs all come with FFs in them? Pierre 25-May-97 7:08:25-GMT,2860;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id DAA02704 for ; Sun, 25 May 1997 03:08:24 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA13201; Sat, 24 May 97 23:43:12 -0700 Message-Id: <9705250643.AA13201@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 2689 (1997-05-25 06:42:40 GMT) To: Multiple Recipients of Reply-To: Edward Cherlin From: "Unicode Discussion" Date: Sat, 24 May 1997 23:42:37 -0700 (PDT) Subject: Re: Unicode plain text Timothy Partridge wrote: >We seem to have two different requirements for plain text here. >Now my assumption was that we would mostly want to use one type, whereas >there seems to be a strong demand for another. At the risk of teaching >you all to suck eggs I will contrast and compare them at some length. >I hope you will find a useful point or two. This is exactly what I was trying to get at in earlier messages. I would say that there are other requirements in other cases, and it would be worth our while to make a stab at enumerating them so we have some idea of what we are talking about. Here are some of the common uses of "plain text", each having a different purpose and different constraints: E-mail Printer command files--ASCII, PostScript Source code--programming, SGML, HTML, TeX Encoded binaries--UUencode, UTF-7 Transfer formats--RTF, APL Workspace Interchange Archiving Portability Database Application file formats Constraints on line length vary widely. I have seen database files with lines of nearly 1000 characters, and of course there is the theorem that any computable function can be expressed in one line of APL. :-) Other constraints will also vary widely. We must allow for this variation, and only specify what we have to. >First the type I had assumed as the default. >I would call this logical formatting. [snip] > The second type I would call physical formatting. [snip] The snipped analysis was quite good, although a few points might be argued. One of the best points is that we can require a certain competence from a Unicode renderer. The implementor can decide which character ranges to support, but having done that must support certain features in the way specified in the standard. This mechanism can be extended to cover some of the requirements of various text file usages. -- Edward Cherlin Help outlaw Spam Everything should be made Vice President http://www.cauce.org as simple as possible, NewbieNet, Inc. 1000 members and counting __but no simpler__. http://www.newbie.net/ 17 May 97 Attributed to Albert Einstein 25-May-97 15:45:55-GMT,4499;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id LAA23257 for ; Sun, 25 May 1997 11:45:54 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA13742; Sun, 25 May 97 08:01:25 -0700 Message-Id: <9705251501.AA13742@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2691 (1997-05-25 15:01:09 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Sun, 25 May 1997 08:01:07 -0700 (PDT) Subject: Re: Unicode plain text In message "Re: Unicode plain text", 'timpart@perdix.demon.co.uk' writes: > We seem to have two different requirements for plain text here. > ... > > First the type I had assumed as the default. > I would call this logical formatting. > ... This first type (usually the result of "save as text" from some WP) always causes me trouble and I usually have to reformat it before I can do anything with it (such as printing it). > The second type I would call physical formatting. > The text has already been formatted by the author into lines and > paragraphs... I think the second type is by far the most common and is what I consider to be plain text: o It's the format of all RFCs, perhaps the most widely-read plain-text files around, o It's the format of the vast majority of email and Usenet posts I read (but I do see some type 1 stuff), o It's the format of much e-documentation that comes with many S/W (eg. linux, TeX (at least installation), X.11, ...), o It's the natural format of all a2ps (ascii-to-postscript) converters I've come across, and (last but not least) o It's the format chosen by project Gutenberg, the wonderful collection of English texts. I have a dream here, of a multi-lingual project Gutenberg with classics in various languages, and, of course, in plain-text Unicode.... (URL: ftp://uiarchive.cso.uiuc.edu/pub/etext/ ) I'd be really curious to see how one would express RFC2070, on "Internationalization of the Hypertext Markup Language", as a type 1 plain-text file (for those looking for a challenge: type 2 plain-text file of this RFC is at: http://ds.internic.net/rfc/rfc2070.txt). Of course, type 2 means some assumptions. > * The author knows exactly how many characters fit on a line. (Often > there is also the assumption that each character is fixed width.) True enough, and that may break down somewhat with ideograms (surely one can't fit 80 of those on a line). But, in general, staying under 80 chars will give a plain-text file that most can print. I rarely have trouble printing a plain-text file of this second type. And I think this will work with a lot of scripts, eg. Russian, Greek, Hebrew, Arabic. > * The author knows exactly how many lines fit on a page. Most plain-text files have no FFs, but when they do (as RFCs do), it's not too difficult to be conservative so that again most folks can print them with no problem. I don't see FFs as being on the slippery slope to pretty text. Besides their use in RFCs (so the TOC can be paginated), they're also often used to separate "chapters". For example, I'll save all the posts on the current threads, and I'll probably put an FF between each one so that, if/when I print the whole thing, I'll get each post to start on a new page. > * The author knows in which sequence the characters in a line will > be printed. (Usually assumes left to right without any reordering.) That's where it gets interesting (and why I had a few questions a few days ago). The only ordering possible within the plain-text Unicode file is of course logical. So that means a bit more intelligence in the a2ps conversion or in the display engines. Or, in despair, such a file could be put thru a filter that would reorder it into visual ordering for local consumption. In summary, notwithstanding some difficulties, I still think a plain-text Unicode file of the second type above makes perfect sense and would be very useful. I'm still not too sure how exactly I would encode it (wrt controls), but this thread has been quite helpful. Btw, this type 1 vs type 2 is a very useful distinction, and I think therein lies the source of much confusion in the current threads. Pierre lew@nortel.ca P.S. It's probable that my view of things is somewhat colored by my Unix bigotry. But still... 25-May-97 23:42:24-GMT,3079;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id TAA17171 for ; Sun, 25 May 1997 19:42:23 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA14407; Sun, 25 May 97 16:23:13 -0700 Message-Id: <9705252323.AA14407@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2692 (1997-05-25 23:22:41 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Sun, 25 May 1997 16:22:39 -0700 (PDT) Subject: RE: Unicode plain text In message "RE: Unicode plain text", Murray writes: > The preformatted plain text works OK as long as you have no plans to > modify it. If you want to edit it, then you have to worry about > reflowing the lines ... Most decent plain-text editors have facilities for that. > ... But even much older software was adept at formatting text. > E.g., troff and TeX have been around for years and do beautiful jobs of > formatting text. Of course, so does HTML today. But none of that is plain text, troff, TeX and HTML require some processing intelligence that may no longer be around in 30 years. That may not be available everywhere. Is there a specification somewhere that tells me how type 1 plain text (using Tim's terminology again for a moment) will be formatted for display and printing? Will things such as the following be dealt with properly? This is a recursive bulleted list. o Bullet one, a very long line..... that folds: - subbullet one a, another long line.... that folds; - a second subbullet o Bullet two. Can I rely on this intelligence to always yield something that reflects my intentions? With recursive bullet lists? With tables. Etc. Ah, maybe that's what some folks mean when they ask for a standard for plain text in Unicode?! Or am I not more likely to see things such as what your email software did to my original post: > > o It's the format of all RFCs, perhaps the most widely-read > > plain-text > > files around, The middle line got folded, but the software didn't realize it was a bulleted list :-) > Within the Microsoft email system, we use rich text ... Well I hope you won't send me such, as I won't know what to do with it. Is it HTML-like markup? Of course rich text can be nice, but only if everyone has it. The nice thing about plain text *is* that everyone has it by default. But I think that applies only to type 2, ie. plain text with hard line breaks, ie. preformatted. The big advantage I see of the type 2 plain text (with hard line breaks) is that it requires *no* intelligence to render correctly. Well Unicode requires BIDI I guess (and let's hope that won't change in the next 30 years). But otherwise, just adjust to line length convention (by chosing a decent point size) and you're in business. No reliance on some S/W to do some undefined reformatting and hope it won't misrepresent your intentions. Pierre 26-May-97 12:40:12-GMT,2068;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id IAA10853 for ; Mon, 26 May 1997 08:40:11 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA15892; Mon, 26 May 97 05:16:43 -0700 Message-Id: <9705261216.AA15892@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 2696 (1997-05-26 12:16:14 GMT) To: Multiple Recipients of Reply-To: "Martin J. Duerst" From: "Unicode Discussion" Date: Mon, 26 May 1997 05:16:12 -0700 (PDT) Subject: Re: Unicode plain text On Mon, 26 May 1997, Otto Stolz wrote: > On May 24, 11:04, Timothy Partridge wrote: > > We seem to have two different requirements for plain text here. > ... > > The text has already been formatted by the author into lines and > > paragraphs. (Just as I have done with this e-mail. [...] > > Since NL usually does not denote any logical division in the text > > it is extremely annoying if the BiDi algorithm treats it as a new > > block. > > In contrary, it is annoying if it doesn't -- see below. The example you give doesn't apply. Independently of whether LS is a block separator or treated as whitespace, there will never be any text part B a line higher than a text part A when logically, text part A is before text part B. This is the very basic principle of the BIDI algorithm. What is affected by the decision whether LS is a block separator or treated as whitespace is whether bidirectional embeding and overwrite codes are terminated (at the block boundary) or not. As long as you don't have any of these, the only effect may be that in the absence of any other convention, the first character of a block defines the block's base directionality. Thus if LS is a block separator, you risk that the second part of the paragraph has a different base directionality than the first. Regards, Martin. 26-May-97 15:26:43-GMT,4060;000000000001 Return-Path: Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id LAA01862; Mon, 26 May 1997 11:26:38 -0400 (EDT) Date: Mon, 26 May 97 11:26:38 EDT From: Frank da Cruz To: Timothy Partridge Cc: Multiple Recipients of Subject: Re: Unicode plain text In-Reply-To: Your message of Sat, 24 May 1997 11:04:00 -0700 (PDT) Message-ID: > We seem to have two different requirements for plain text here. > Now my assumption was that we would mostly want to use one type, whereas > there seems to be a strong demand for another. > ... > First the type I had assumed as the default. > I would call this logical formatting. > > Paragraph Separator is most commonly used. Text usually runs on without > any control characters until a new paragraph is needed. Since this > is logical formatting the author does not know or care whether a > paragraph is indicated by a completly blank line or a new line is > started with an indent or some other convention. > I suppose this is, indeed, a form of plain text, but I would call it "input for a text formatter", not text to be used and viewed on its own as it stands. It is a degenerate case of a larger class, e.g. input for TeX, Scribe, Troff, IPFC, SGML, or HTML (for text formatting). It is only in the last few years that I began to receive "long-line" text in email, and I can only suppose that it was generated by some sort of editor that does its own word wrapping during input, but does not send the line breaks on the mistaken assumption that every email client in the world is (or should be) also a text formatter. [The second type of plain text...] > The assumptions behind this explicit approach include: > * The text will go straight to a printer that is not very bright. > * The author knows exactly how many characters fit on a line. (Often > there is also the assumption that each character is fixed width.) > * The author knows exactly how many lines fit on a page. > * The author knows in which sequence the characters in a line will > be printed. (Usually assumes left to right without any reordering.) > Right -- this is the kind people have been using for more decades than many of us have been alive. It does not deserve the bad rap. Of course we all find it irritating when the composer of such text assumes wider or longer pages than we have, but that is not a reason to abolish this, the most common form of plain text -- in fact, it is all the more reason to set standards for its use. "Standard lines are so wide; standard pages are so long", etc. Such standards tend to be set of their own volution, e.g. among e-mail and netnews users, where recipients of badly formatted messages tend to take it on themselves to educate the senders as to common practice. Ideally, preformatted plain text can also be fed into your favorite rendering engine to produce the effect that most pleases your eye, and indeed we have been doing this sort of thing for decades with many formatters. I grant that automatic recognition of nested bullet lists or meticulously formatted tables might be a stretch, but it is certainly not difficult to treat blank lines as paragraph separators, and otherwise to ignore line breaks when reformatting prose such as this. But once any kind of markup ("this is a table", "this is a bullet list", "this is a section of preformatted text") is introduced, our plain text becomes "input for a text formatter". Incidentally, another form of plain text is "output from a text formatter", which often has been hyphenated. Such text is an end result, not intended for further processing. I think that living in a world of email has demonstrated the value of plain text, at least to most people. The lesson is that this is the only text form that can be sent without prior prearrangement with any reasonable expectation that it will be readable at its destination. - Frank 26-May-97 15:48:20-GMT,2862;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id LAA06491 for ; Mon, 26 May 1997 11:48:19 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA16506; Mon, 26 May 97 07:38:18 -0700 Message-Id: <9705261438.AA16506@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 X-Uml-Sequence: 2698 (1997-05-26 14:37:52 GMT) To: Multiple Recipients of Reply-To: Otto Stolz From: "Unicode Discussion" Date: Mon, 26 May 1997 07:37:50 -0700 (PDT) Subject: Rare Writing Directions Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by watsun.cc.columbia.edu id LAA06491 Some scripts are neither left-to-right, nor right-to-left. 1. Mongolian is written top-to-bottom; Japanese and Chinese used to be written this way, the lines were stacked right-to-left. Recently, somebody (sorry, I haven't kept that note) has said that mixing Latin with Japanese was impossible, hence modern Japanese is written left-to-right. However, there is a way to mix top-to-bottom with horizontally written scripts: about twenty years ago I have seen a book in Japanese, written top-to-bottom, with German proper, and place, names imbedded. These were also written top-to-bottom, with the glyphs rotated by 90 degrees; so you could turn the book counter- clockwise to read these names, in the usual way. This imebedding method would also work with left-to-right phrases in Mongolian text. For righ-to-left scripts, you would have to turn the glyphs the other way. I think, it would be useful to have this method described in a forthcoming Unicode standard. 2. Some old scripts (Greek, Latin, Hethitic, Runes) were used to write boustropheda. A boustrophedon runs back and forth like a ploughing ox (thence the name), i.e. the lines are written, alternatingly, left-to-right and right-to-left. As Unicode will adopt the Runes alphabet (or rather: fuώark), it would propbably be useful to have boustrophedon-markers akin to the existing LEFT-TO-RIGHT MARK and its siblings, U+200E .. U+200F and U+202A .. U+202E. These markers could be used to mark plain, logically formatted, Unicode text. (To mark physically formatted text, you could probably use the OVERRIDE characters, U+202D and U+202E.) Also a normative boustrophedon algorithm, akin to the existing bidi algorithm would probably be nice to have. I guess, this algorithm could be much simpler than the bidi algorithm, as the boustrophedon feature will apply only to whole paragraphs (it is more like a layout style, which does not have to allow for intrinsic character features). Opinions? Am I wrong, again? Best wishes, Otto Stolz 26-May-97 16:18:23-GMT,1285;000000000011 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id MAA10524 for ; Mon, 26 May 1997 12:18:22 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA16649; Mon, 26 May 97 08:21:01 -0700 Message-Id: <9705261521.AA16649@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2699 (1997-05-26 15:20:37 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Mon, 26 May 1997 08:20:35 -0700 (PDT) Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) In message "Multi-Lingual Project Gutenberg (was: Unicode plain text)", 'Otto.Stolz@uni-konstanz.de' writes: > You'll find the German project Gutenberg (in German, of course), under > . The format > is currently HTML, in ISO 8859-1 encoding. Thanks for the pointer, I don't think I had it. Well done (just had a look at Max and Moritz). HTML certainly is an interesting alternative to plain text because it is so universal (and, hopefully, with a stable foundation). And it allows to include illustrations, annotations, &c. Pierre 26-May-97 16:43:54-GMT,2364;000000000001 Return-Path: Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id MAA15283; Mon, 26 May 1997 12:42:51 -0400 (EDT) Date: Mon, 26 May 97 12:42:51 EDT From: Frank da Cruz To: "Pierre Lewis" Cc: Multiple Recipients of Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) In-Reply-To: Your message of Mon, 26 May 1997 08:20:35 -0700 (PDT) Message-ID: > HTML certainly is an interesting alternative to plain text because it > is so universal (and, hopefully, with a stable foundation). And it > allows to include illustrations, annotations, &c. > There is an infinite number of alternatives to plain text. Anybody, anywhere can make up whatever such alternatives they like -- and they do. HTML is controlled by Netscape and Microsoft, and changes every five minutes as each attempts to outdo and undercut the other. Plain text is an interesting alternative to HTML because nobody controls it but "just us chickens", and it alone stands a chance of surviving year after year, decade after decade, as the corporate giants pull the rug out from each other (and us) on a weekly basis, with their proclamations of ever more complex proprietary "standards" with which we all must "comply". This is not to say that a simple and stable form of HTML -- say 1.0, but augmented by some minimally adequate method of coping with character sets -- is not a suitable method for publishing literary classics on the Web -- after all, this is the sort of thing the Web was originally designed for, lest we forget... But this is not to say that even a stable form of HTML could be thought of as a replacement for plain text. My printer does not render HTML; my email client is not a Web browser. My text editor is not an HTML authoring system. My C compiler does not compile HTML. My Telnet client does not interpret HTML. And perhaps most important, the incomprehensibly enormous corpus of existing plain-text information does not need to be converted to HTML or anything else (except perhaps Unicode plain text), especially since any such requirement would leave most of it behind, and even that which was deemed worthy of conversion would become obsolete as soon as HTML is replaced by the next thing. - Frank 26-May-97 18:29:49-GMT,2166;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id OAA01077 for ; Mon, 26 May 1997 14:29:48 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA17491; Mon, 26 May 97 10:23:48 -0700 Message-Id: <9705261723.AA17491@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2706 (1997-05-26 17:23:31 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Mon, 26 May 1997 10:23:30 -0700 (PDT) Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) In message "re:Multi-Lingual Project Gutenberg (was: Unicode plain text)", 'fdc@watsun.cc.columbia.edu' writes: > ... HTML is > controlled by Netscape and Microsoft, and changes every five minutes as each > attempts to outdo and undercut the other. I thought at least some baseline HTML came from more neutral bodies than these two corporations?! Of course, HTML is an acceptable alternative to plain text *only* if it is corporation-neutral, widespread, and reasonably stable. I certainly wouldn't agree to any MSIE-or NN-specific extensions being used in the texts offered by these projects, but this specific site is quite legible with lynx, so I assume it doesn't use too many fancy features. > Plain text is an interesting alternative to HTML because nobody controls it > but "just us chickens", and it alone stands a chance of surviving year after > year, decade after decade, ... Well put. > This is not to say that a simple and stable form of HTML -- say 1.0, but > augmented by some minimally adequate method of coping with character sets -- Since the German Gutenberg project uses latin 1 (the HTML default), they don't even need any extensions over HTML 1.0. > ... My printer does not render HTML; my email > client is not a Web browser. ... Same here. Still, browsers are getting pretty common, so for a project Gutenberg, it's probably a reasonable choice. Pierre 26-May-97 19:25:43-GMT,4512;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id PAA10016 for ; Mon, 26 May 1997 15:25:41 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA17805; Mon, 26 May 97 11:35:09 -0700 Message-Id: <9705261835.AA17805@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2707 (1997-05-26 18:34:53 GMT) To: Multiple Recipients of Reply-To: Timothy Partridge From: "Unicode Discussion" Date: Mon, 26 May 1997 11:34:50 -0700 (PDT) Subject: Re: Unicode plain text Pierre Lewis recently said: > This first type (usually the result of "save as text" from some WP) > always causes me trouble and I usually have to reformat it before I can > do anything with it (such as printing it). In my opinion a Unicode renderer should cope with this automatically and divide paragraphs up into lines for you. This is mostly because of the intelligence of the BiDi algorithm. What you won't get is page headers and footers and page numbers since there is no way to specify them in Unicode plain text. Is there general agreement that text that is only split into paragraphs should be rendered properly by a Unicode engine? I.e. it is acceptable as plain text. > I think the second type is by far the most common and is what I > consider to be plain text: > > o It's the format of all RFCs, perhaps the most widely-read plain-text > files around, [snip] > o It's the format chosen by project Gutenberg, the wonderful collection > of English texts. I have a dream here, of a multi-lingual project > Gutenberg with classics in various languages, and, of course, in > plain-text Unicode.... > > (URL: ftp://uiarchive.cso.uiuc.edu/pub/etext/ ) > > I'd be really curious to see how one would express RFC2070, on > "Internationalization of the Hypertext Markup Language", as a type 1 > plain-text file (for those looking for a challenge: type 2 plain-text > file of this RFC is at: http://ds.internic.net/rfc/rfc2070.txt). Can I have the original source please! I suspect that documents like this have been prepared in some markup language and sent through something like troff. > Of course, type 2 means some assumptions. > > > * The author knows exactly how many characters fit on a line. (Often > > there is also the assumption that each character is fixed width.) > > True enough, and that may break down somewhat with ideograms (surely > one can't fit 80 of those on a line). But, in general, staying under 80 > chars will give a plain-text file that most can print. I rarely have > trouble printing a plain-text file of this second type. And I think this > will work with a lot of scripts, eg. Russian, Greek, Hebrew, Arabic. I'm not so sure that fixed width Arabic will look good but the general point holds. But should I need to fiddle with point sizes if Unicode renderers will accept type 1 text. Type 2 text is very common. And it is the published form. In some cases the original marked up text will have been lost. Where it hasn't a Unicode type 1 style plain text file could be produced from the original. I dug out some troff documentation and it says that the plain text output is a representation that is an approximation to the printed page. I suggest that much of the type 2 text is in this form, i.e. Formatting *including* BiDi has already been carried out. Does anyone have examples of mixed direction text in RFC style format that could confirm this? I think that for type 2 physical format files Unicode rendering is *too* intelligent and would scramble the preformatted lines if they contained BiDi text. (As well as getting horribly confused by the NLs which presumably have been converted to Line Separator.) I would propose a new control code - Disable BiDirectional Processing which would switch off BiDi altogether. It could be used with physical format files so that they come out as intended. (There needs to be an Enable code as well.) I'll also allow you a Page Separator. This would be treated as a block separator by BiDi and would cause a new page to be started. The introduction of a new control code would mean that existing text that uses the current standard would work in the same way, but additional control could be given to text that needs it. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer 27-May-97 14:26:28-GMT,1209;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id KAA28296 for ; Tue, 27 May 1997 10:26:27 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA21401; Tue, 27 May 97 06:34:30 -0700 Message-Id: <9705271334.AA21401@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2718 (1997-05-27 13:33:37 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Tue, 27 May 1997 06:33:36 -0700 (PDT) Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) With waivering faith I wrote: :-) > HTML certainly is an interesting alternative to plain text because it > is so universal (and, hopefully, with a stable foundation). And it > allows to include illustrations, annotations, &c. Coincidently, I was reading last nite (ironically, in "iX", a German magazine) about XML (eXtensible Markup Language) which, says the article, could replace (in the mid term) HTML as the lingua franca of the Web. So much for that idea... Es lebe plain text! (long live ~) Pierre 27-May-97 17:30:54-GMT,2490;000000000011 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id NAA03642 for ; Tue, 27 May 1997 13:30:52 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA22032; Tue, 27 May 97 09:17:26 -0700 Message-Id: <9705271617.AA22032@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 2720 (1997-05-27 16:16:36 GMT) To: Multiple Recipients of Reply-To: John Fieber From: "Unicode Discussion" Date: Tue, 27 May 1997 09:16:34 -0700 (PDT) Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) On Tue, 27 May 1997, Pierre Lewis > With waivering faith I wrote: > :-) > > > HTML certainly is an interesting alternative to plain text because it > > is so universal (and, hopefully, with a stable foundation). And it > > allows to include illustrations, annotations, &c. > > Coincidently, I was reading last nite (ironically, in "iX", a German > magazine) about XML (eXtensible Markup Language) which, says the > article, could replace (in the mid term) HTML as the lingua franca of > the Web. So much for that idea... Both HTML and XML rest on a very stable foundation: SGML. The unicode standard defers quite a number of things to "higher level protocols". SGML just such a protocol, XML represents a profile of the SGML standard that makes writing processing applications a lot easier. If you invest a lot of energy building a document system around HTML, you will be SOL when HTML falls out of fashion. If you spend the same energy building a document system on the SGML foundation, you can automatically deal with HTML and all its variants, XML, or whatever the next fad is. Real SGML tools are polymorphic. > Es lebe plain text! (long live ~) I find this a tragic position. Before unicode, the common denominator for cross-platform data transfer was 7 bit ASCII. Unicode charged ahead to raise the common denominator but statements like this essentially say that the common denominator should go no further. This is counter to the spirit that inspired Unicode and counter to the standard itself which explicitly defers a number of important dimensions of text processing to higher level protocols. Plain text is simply not an option for most anyone serious about their documents. -john 27-May-97 18:42:14-GMT,2820;000000000001 Return-Path: Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id OAA23125; Tue, 27 May 1997 14:40:39 -0400 (EDT) Date: Tue, 27 May 97 14:40:38 EDT From: Frank da Cruz To: John Fieber Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) In-Reply-To: Your message of Tue, 27 May 1997 09:16:34 -0700 (PDT) Message-ID: > > Es lebe plain text! (long live ~) > > I find this a tragic position. Before unicode, the common > denominator for cross-platform data transfer was 7 bit ASCII. > Unicode charged ahead to raise the common denominator but > statements like this essentially say that the common denominator > should go no further. This is counter to the spirit that > inspired Unicode and counter to the standard itself which > explicitly defers a number of important dimensions of text > processing to higher level protocols. > But that is to say that Unicode is useless except in combination with a higher level protocol over which it has no control. I have absolutely no faith in any higher level protocol. They come into fashion and then exit ignominiously with astounding speed. So perhaps the need for plain text is "tragic" (so too would be the fact that many citizens of earth do not possess high-end bit-mapped rendering engines, let alone sufficient food to eat), but it is nonetheless real. I think a lot of Unicoders have little idea what the real world is like. They know it is populated by people who speak many languages written in diverse writing systems, which is a step forward. But they don't pay much attention to the "low tech" computer-related components of everyday life -- not only in the less "developed" countries, but even in the rich ones. They seem to believe that the only use for computers any more is Web browsing and composition of glossy (multilingual) sales brochures. Try to remember all the real work that computers are doing every day in hidden places: medical and laboratory equipment, manufacturing equipment, telecommunications equipment, traffic control, POS, EDI, etc. Case in point: the imbedded microprocessors and microcontrollers whose interface to the outside world is a lowly serial port, and which have only a few K available for their control program. Countless millions of them, chosen precisely for their low cost. Now, isn't it our goal for Unicode to become, eventually, the world's one-and-only character set? Good! Then let's not lock out the low end. Let's see if we can't separate the concept of character set from the *necessity* for higher (and lower) level protocols and the need for a high-end rendering engine. (Sure, use them if you want, but that's a totally separate issue.) - Frank 27-May-97 21:05:57-GMT,2979;000000000001 Return-Path: Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id RAA00789 for ; Tue, 27 May 1997 17:05:54 -0400 (EDT) Received: from localhost (jfieber@localhost) by fallout.campusview.indiana.edu (8.8.5/8.8.5) with SMTP id QAA01376 for ; Tue, 27 May 1997 16:05:45 -0500 (EST) Date: Tue, 27 May 1997 16:05:44 -0500 (EST) From: John Fieber To: Frank da Cruz Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Tue, 27 May 1997, Frank da Cruz wrote: > > > Es lebe plain text! (long live ~) > > > > I find this a tragic position. Before unicode, the common > > denominator for cross-platform data transfer was 7 bit ASCII. > > Unicode charged ahead to raise the common denominator but > > statements like this essentially say that the common denominator > > should go no further. This is counter to the spirit that > > inspired Unicode and counter to the standard itself which > > explicitly defers a number of important dimensions of text > > processing to higher level protocols. > > > But that is to say that Unicode is useless except in combination > with a higher level protocol over which it has no control. I never said and most certainly did not mean to imply that Unicode is "useless" without higher level protocols. That proposition is absurd. > I have absolutely no faith in any higher level protocol. They come > into fashion and then exit ignominiously with astounding speed. Your opinion does not change the fact that a great many applications would be impossible without higher level protocols, transient or otherwise. (I'd hardly describe SGML as transient though--it dates back into the 1960s and has continuously gained in pouplarity ever since with no sign of fading in the future.) [statements about the real world] > Now, isn't it our goal for Unicode to become, eventually, the world's > one-and-only character set? Good! Then let's not lock out the low > end. Let's see if we can't separate the concept of character set > from the *necessity* for higher (and lower) level protocols and the > need for a high-end rendering engine. ...but I never said anything about a monolithic standard including low and high level protocols! I'd be the first to say it would be a Bad Idea for exactly the reasons you cite. I would also add that separation is critical because different applications may need different high level protocols. SGML works great for publishing type applications, but it certainly is not an answer to every text processing applications. -john 27-May-97 21:19:35-GMT,3042;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id RAA03062 for ; Tue, 27 May 1997 17:19:31 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA23523; Tue, 27 May 97 13:04:30 -0700 Message-Id: <9705272004.AA23523@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 2726 (1997-05-27 20:04:01 GMT) To: Multiple Recipients of Reply-To: John Fieber From: "Unicode Discussion" Date: Tue, 27 May 1997 13:03:59 -0700 (PDT) Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) On Tue, 27 May 1997, Marion Gunn > The general consensus there was that > incipient XML was being very heavily pushed as an alternative to html by > SUN and MICROSOFT in collaboration Sun has been actively involved in the development of XML, so their position is no surprise. Lately Microsoft has been jumping on the "standards" bandwagon (witness the ditching of WINS for DNS, adoption of Kerberos, etc.) and a move to XML in particular represents taking a distinctly different direction than Netscape, whose founder has publicly stated that SGML is stupid--a position I firmly believe will only hasten Netscape's death if it persists. > (as an alternative which would eliminate > markup language altogether from the actual text to be transferred). This is nonsensical. In the world of HTML, you have a fixed set of tags you can use in your documents, and you must assume that the browser knows how to do something sensible with them (not always safe). With XML, or SGML for that matter, your document gets marked up using tags appropriate for the data being marked up. The document gets sent to the browser along with a style sheet so that the browser can do something sensible when it encounters the markup. This allows for (a) more concise and precise markup of the document and (b) more precise control over the ultimate rendering by the browser. The push for XML represents a "back to the roots" movement. The basic premise of SGML is that it is impossible to define a markup language that is both general and precise. Thus, SGML is a meta-language; a language for defining markup languages. At a technical level, SGML standardizes parsing--how to distinguish markup from data. HTML is just a single markup language defined in terms of SGML. However, the promotion of HTML as a universal exchange format is fundamentally at odds with the spirit of SGML. A problem with using SGML in a web environment is the complexity of the software required to implement the parsing rules. Enter XML. XML basically does away with numerous non-essential features of SGML that complicate parsing, things like tag omission and minimization, shortrefs and the like. XML also raises the compliance bar on character encoding from 7 bit ASCII to Unicode. -john 27-May-97 21:23:11-GMT,2405;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id RAA03727 for ; Tue, 27 May 1997 17:23:10 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA23596; Tue, 27 May 97 13:08:43 -0700 Message-Id: <9705272008.AA23596@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2727 (1997-05-27 20:08:24 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Tue, 27 May 1997 13:08:22 -0700 (PDT) Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) In message "re:Multi-Lingual Project Gutenberg (was: Unicode plain text)", 'jfieber@indiana.edu' writes: > > Es lebe plain text! (long live ~) > > I find this a tragic position. Before unicode, the common > denominator for cross-platform data transfer was 7 bit ASCII. First, don't take anything I write too literally. I make available most of my project documentation in HTML. So I'm not religious about these things. The above is not an exclusive statement. HTML serves a most useful purpose and I'm not saying to ban it! Second, Unicode is something more or less orthogonal to the notion of plain text. So I don't really understand your comment above. Plain text does not mean 7-bit ASCII. It could just as well mean UTF-8 Unicode. Third, for all the great things that can be said for SGML, HTML, XML, and ML, it still remains that plain text is the most portable format, the simplest to deal with (on all platforms), and the only one that is likely to be legible in 30 years. For some things, it's still the best solution. > Plain text is simply not an option for most anyone serious about > their documents. That depends on the purpose. For example, I'm writing some biographical notes on myself (how pretentious can one get :-)?) so my son will know a bit about me should I leave early. I can't think of a better medium for that than plain text (Latin 1 here). Surely not some WP that will be so badly out of style by the time he gets to read the stuff (he doesn't talk yet)... And look at a typical novel. Plain text is all that's required to capture it. Marketing glossies are another matter of course. And so is most technical documentation. Anyway, getting off topic again! Pierre 27-May-97 22:15:20-GMT,2216;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id SAA20653 for ; Tue, 27 May 1997 18:15:18 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA24108; Tue, 27 May 97 14:24:35 -0700 Message-Id: <9705272124.AA24108@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2729 (1997-05-27 21:24:18 GMT) To: Multiple Recipients of Reply-To: Timothy Partridge From: "Unicode Discussion" Date: Tue, 27 May 1997 14:24:16 -0700 (PDT) Subject: Re: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) Pierre Lewis recently said: > With waivering faith I wrote: > :-) > > > HTML certainly is an interesting alternative to plain text because it > > is so universal (and, hopefully, with a stable foundation). And it > > allows to include illustrations, annotations, &c. > > Coincidently, I was reading last nite (ironically, in "iX", a German > magazine) about XML (eXtensible Markup Language) which, says the > article, could replace (in the mid term) HTML as the lingua franca of > the Web. So much for that idea... > > Es lebe plain text! (long live ~) And what about the Standard Generalised Markup Language (SGML)? This has been around for ages. It lets you define a set of markup tags and then use them. HTML is a particular set of SGML tags and the SGML definition of HTML (the DTD) is available from W3. If you are writing text in HTML I would strongly recommend that you put a DTD version declaration at the top. e.g. which is English with HTML 3.2 markup. Then syntax check the HTML with a SGML parser to make sure it conforms. Finally keep a copy of the DTD somewhere safe along with a copy of the matching HTML standard so that future generations can always understand your text. (The copy of 3.2 that I have is about 12K in size.) You might want a copy of the SGML standard too - I don't know where to get a machine readable copy from. Tim -- Tim Partridge. Any opinions expressed are mine only and not those of my employer 28-May-97 0:12:27-GMT,2766;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id UAA10920 for ; Tue, 27 May 1997 20:12:26 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA24725; Tue, 27 May 97 16:51:28 -0700 Message-Id: <9705272351.AA24725@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2731 (1997-05-27 23:51:00 GMT) To: Multiple Recipients of Reply-To: kenw@sybase.com (Kenneth Whistler) From: "Unicode Discussion" Date: Tue, 27 May 1997 16:50:58 -0700 (PDT) Subject: Unstable foundations and wavering faith > With waivering faith I wrote: > :-) > > > HTML certainly is an interesting alternative to plain text because it > > is so universal (and, hopefully, with a stable foundation). > > Es lebe plain text! (long live ~) It is no accident that Silicon Valley thrives in Earthquake country. But while everything seems to be in constant turmoil, and yesterday's hot new item is today's trash -- try to take the long view. 1. The Information Technology industry is still in its adolescent phase (no longer its infancy, certainly), but maturing rapidly. As industrial technology matures, it tends to stabilize into well- understood, efficient patterns, with competition for innovations just fizzing around the edges. Handling of multilingual text as part of the general problem of automated information technology is still in ferment, but we can see the beginnings of the crystallizations of well-understood, accepted ways of dealing with the issues on computers. 2. Unicode is laying the (firm, we hope) foundation for plain text representation through the next century--perhaps longer. In any case, like ASCII, it should last long enough to gain the lustrous, comfortable patina of trusted age. Just as my nieces now find it hard to conceive of a political age before Ronald Reagan, people just being introduced to computer science and programming in Java will find it hard to conceive of character sets before Unicode. --Ken (Color me rosy) Whistler P.S. For those who, like me, worry that all electronic data not in plain text (and ASCII plain text at that) is in constant danger of disappearing into the enormous historical bit bucket of undecipherable formats using undecipherable encodings on obsolete media, consider the following: Perhaps the greatest source of information loss in the longrun was the shift by the publishing industry to use of cheap high-acid papers early in this century. Ask librarians about the conditions of their pre-War collections (my nieces just asked, "The Gulf war?") of books. Or how about all the nitrate movie film stock collapsing into dust? 28-May-97 1:36:52-GMT,2584;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id VAA24050 for ; Tue, 27 May 1997 21:36:49 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA25007; Tue, 27 May 97 18:18:07 -0700 Message-Id: <9705280118.AA25007@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 2733 (1997-05-28 01:17:40 GMT) To: Multiple Recipients of Reply-To: Giles S Martin From: "Unicode Discussion" Date: Tue, 27 May 1997 18:17:35 -0700 (PDT) Subject: Re: Unstable foundations and wavering faith It's getting a little off-topic, but ... . Arguably the single event causing the greatest information loss was the destruction of the library at Alexandria, which broke countless links in chains of transmission of unique manuscripts. Acid paper and nitrate film have destroyed lots of copies, but most information of any signnificance produced in this era has been reproduced in lots of copies, and procographically recopied at a trivial cost compared to the cost of copying a manuscript by hand (which is why there were so many unique copies in Alexandria). Giles #### ## Giles Martin ####### #### Quality Control Section ################# University of Newcastle Libraries #################### New South Wales, Australia ###################* E-mail: ulgsm@dewey.newcastle.edu.au ##### ## ### Phone: +61 49 215 828 (International) Fax: +61 49 215 833 (International) ## The web of our life is of a mingled yarn, good and ill together -- All's Well That Ends Well, IV.iii.98-99 On Tue, 27 May 1997, Kenneth Whistler wrote: > P.S. For those who, like me, worry that all electronic data > not in plain text (and ASCII plain text at that) is in constant > danger of disappearing into the enormous historical bit bucket > of undecipherable formats using undecipherable encodings on > obsolete media, consider the following: Perhaps the greatest source > of information loss in the longrun was the shift by the publishing > industry to use of cheap high-acid papers early in this century. > Ask librarians about the conditions of their pre-War collections > (my nieces just asked, "The Gulf war?") of books. Or how about > all the nitrate movie film stock collapsing into dust? 28-May-97 2:32:39-GMT,3210;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id WAA01024 for ; Tue, 27 May 1997 22:32:38 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA25197; Tue, 27 May 97 18:53:00 -0700 Message-Id: <9705280153.AA25197@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 2734 (1997-05-28 01:52:43 GMT) To: Multiple Recipients of Reply-To: John Fieber From: "Unicode Discussion" Date: Tue, 27 May 1997 18:52:41 -0700 (PDT) Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) On Tue, 27 May 1997, Pierre Lewis > In message "re:Multi-Lingual Project Gutenberg (was: Unicode plain text)", > 'jfieber@indiana.edu' writes: > > > > Es lebe plain text! (long live ~) > > > > I find this a tragic position. Before unicode, the common > > denominator for cross-platform data transfer was 7 bit ASCII. > > Second, Unicode is something more or less orthogonal to the notion of > plain text. So I don't really understand your comment above. Plain text > does not mean 7-bit ASCII. It could just as well mean UTF-8 Unicode. >From other replies I've received I guess I wasn't clear about my point. Within the domain of "plain text" Unicode is doing a lot to raise the common denominator. This is great, but a sentiment has been expressed in this thread that higher level protocols are a hopeless mess and if you want portability, stick with plain text. In the near term that may be a reality but Unicode was born out of frustration with the existing mess of character encoding standards and a determination to make things better. I was simply making the observation that swearing off high level protocols because they are messy now seems very out of character with the spirit of Unicode. To clarify another posting, I did not say or mean to imply that higher level protocols should be addressed by the Unicode standard. That would be a Bad Thing for numerous reasons I'm sure you can all figure out. > Third, for all the great things that can be said for SGML, HTML, XML, > and ML, it still remains that plain text is the most portable > format, the simplest to deal with (on all platforms), and the only one > that is likely to be legible in 30 years. For some things, it's still > the best solution. Explain to me how SGML is less portable than plain text? If you don't have something that understand the tags, any reasonable text editor can strip them out leaving you with plain text. You don't need anything fancier than a text editor to create and view SGML documents. You are no *worse* off using SGML than you would be using plain text, but chances are good that you will be better off. In 30 years, SGML will still be legible because, unlike other markup schemes, it is a public standard not bound to a particular transient software product. This is why you find SGML in places like the aircraft industry where documents have active lifespans longer than most software companies. -john 28-May-97 3:24:59-GMT,2803;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id XAA06051 for ; Tue, 27 May 1997 23:24:58 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA25358; Tue, 27 May 97 19:22:40 -0700 Message-Id: <9705280222.AA25358@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 2736 (1997-05-28 02:21:41 GMT) To: Multiple Recipients of Reply-To: John Fieber From: "Unicode Discussion" Date: Tue, 27 May 1997 19:21:39 -0700 (PDT) Subject: Re: Unstable foundations and wavering faith On Tue, 27 May 1997, Unicode Discussion wrote: > --Ken (Color me rosy) Whistler > > P.S. For those who, like me, worry that all electronic data > not in plain text (and ASCII plain text at that) is in constant > danger of disappearing into the enormous historical bit bucket > of undecipherable formats using undecipherable encodings on > obsolete media, consider the following: Perhaps the greatest source > of information loss in the longrun was the shift by the publishing > industry to use of cheap high-acid papers early in this century. > Ask librarians about the conditions of their pre-War collections No need to worry about electronic data disappearing in the future, it has been disappearing for quite some time now thanks to being stored on flakey or obsolete media, or in undocumented data formats of long extinct software. In a former life as a librarian, I spent quite a bit of time dealing with electronic data sneaking into the library inside the back covers of books and in other ways. Librarians have been fretting over digital data for some time now. Unlike computer scientists, we have been through the preservation thing many times. It is true, a book published in the 1700 is as good as new (okay, I exagerate a bit...) while relatively recent publications turn to dust thanks to cheap paper. Most of the computer science literature has been published after the "acid incident" so as a discipline, they tend to be are blissfully ignorant of the event. The problem is not really that data isn't in plain text format, although that is sometimes helpful, but that the formats are (a) not documented and (b) there are way too many of them. Even if they were documented, condition (b) makes it too expensive to deal with unless it is *really* important data. SGML makes a serious attack on both problems. I just hope the marriage of SGML and Unicode in the form of XML is successful in bringing portable, durable documents to the masses. Then continue ironing out the storage media qirks and librarians will be happy. :) -john 28-May-97 3:40:48-GMT,4183;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id XAA09453 for ; Tue, 27 May 1997 23:40:48 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA25292; Tue, 27 May 97 19:17:48 -0700 Message-Id: <9705280217.AA25292@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="ISO-8859-1" X-Uml-Sequence: 2735 (1997-05-28 02:17:31 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Tue, 27 May 1997 19:17:30 -0700 (PDT) Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by watsun.cc.columbia.edu id XAA09453 In message "re:Multi-Lingual Project Gutenberg (was: Unicode plain text)", 'jfieber@indiana.edu' writes: > I was simply making the observation that swearing off high level > protocols because they are messy now seems very out of character > with the spirit of Unicode. I don't see them as messy, just as short-lived. I don't perceive HTML as messy, quite the opposite (notwithstanding frequent abuse by authors such as using tags to get bold/bigger), but I don't expect to still use it in 30 years. For my part, I'm not swearing off high level protocols, but I think a very good point can be made for plain text, and I had a few questions I wished clarified wrt Unicode. That's all. > Explain to me how SGML is less portable than plain text? If you > don't have something that understand the tags, any reasonable > text editor can strip them out leaving you with plain text. I don't know SGML, but let's try the exercise with an HTML page I wrote (chosen randomly amongst the ones I can show outside): HTML source

Connecting an HP LaserJet 5M at home

By Pierre Lewis (aka tιlιLew).

This short page provides some notes on using an HP LaserJet 5M connected to a home setup. If you have comments or encounter problems, don't hesitate to call me (x8207).

The description is specific to the HP LaserJet 5M. Some useful information can also be found on the page about connecting a LaserWriter II NTX to a home NCD.

Basic connectivity

  • The normal way to connect the LJ5M to your home setup is via the Ethernet port. This requires some kind of hub to interconnect the Gandalf box, the NCD Same with tags stripped (almost illegible: headings, bullets gone) Connecting an HP LaserJet 5M at home By Pierre Lewis (aka tιlιLew). This short page provides some notes on using an HP LaserJet 5M connected to a home setup. If you have comments or encounter problems, don't hesitate to call me (x8207). The description is specific to the HP LaserJet 5M. Some useful information can also be found on the page about connecting a LaserWriter II NTX to a home NCD. Basic connectivity The normal way to connect the LJ5M to your home setup is via the Ethernet port. This requires some kind of hub to interconnect the Gandalf box, the NCD Same as a decent plain text file (formatted by lynx -- Tim's type 2) Connecting an HP LaserJet 5M at home _By Pierre Lewis (aka tιlιLew)._ This short page provides some notes on using an HP LaserJet 5M connected to a home setup. If you have comments or encounter problems, don't hesitate to call me (x8207). The description is specific to the HP LaserJet 5M. Some useful information can also be found on the page about [1]connecting a LaserWriter II NTX to a home NCD. Basic connectivity * The normal way to connect the LJ5M to your home setup is via the Ethernet port. This requires some kind of hub to interconnect the Gandalf box, the NCD ... References 1. file://localhost/tmp/lw2ntx.html Wonder what the SGML version of above would look like. Pierre 28-May-97 13:18:29-GMT,1533;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id JAA13596 for ; Wed, 28 May 1997 09:18:28 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA26845; Wed, 28 May 97 05:56:52 -0700 Message-Id: <9705281256.AA26845@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2737 (1997-05-28 12:56:14 GMT) To: Multiple Recipients of Reply-To: Kent Karlsson From: "Unicode Discussion" Date: Wed, 28 May 1997 05:56:12 -0700 (PDT) Subject: SGML (Was: Re: Multi-Lingual Project Gutenberg (was: Unicode plain text)) Hi! Sorry for asking a maybe trivial question (and for getting a bit off-track): > > which is English with HTML 3.2 markup What "in English"? English markup or English "proper text"? I could imagine (though there is none now) HTML 3.2 markup in, say, Swedish. But are you saying that if the "proper text" of the document is in, say, Swedish, I should write at the top, even if the markup is "in English"? (I thought that the "EN" meant that the **markup** is based on English words.) And language attributes are to become a part of HTML, suitable also for multilingual "proper texts"... (Sorry, I don't know SGML.) /kent k 28-May-97 15:48:53-GMT,3968;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id LAA15268 for ; Wed, 28 May 1997 11:48:47 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA27338; Wed, 28 May 97 07:27:59 -0700 Message-Id: <9705281427.AA27338@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2741 (1997-05-28 14:27:34 GMT) To: Multiple Recipients of Reply-To: Frank da Cruz From: "Unicode Discussion" Date: Wed, 28 May 1997 07:27:32 -0700 (PDT) Subject: re:Multi-Lingual Project Gutenberg (was: Unicode plain text) > From other replies I've received I guess I wasn't clear about my > point. Within the domain of "plain text" Unicode is doing a lot > to raise the common denominator. This is great, but a sentiment > has been expressed in this thread that higher level protocols are > a hopeless mess and if you want portability, stick with plain > text. In the near term that may be a reality but Unicode was > born out of frustration with the existing mess of character > encoding standards and a determination to make things better. > > I was simply making the observation that swearing off high level > protocols because they are messy now seems very out of character > with the spirit of Unicode. > Nobody advocates stamping out higher level protocols, even if that were possible. We all use them all the time. I, for one, use them with my eyes open -- i.e. with full knowledge that all the work I put into creating a "rich" document will need to be done again at some point when the current "standard" for richness has been replaced by a new one if I want the document to survive. And again. And again. I remember the excitement when it first became possible to produce typeset-quality documents with Troff, R, DSR, Scribe, TeX, and their relatives. But I also continued to produce plain-text "documents" on a daily basis: email; netnews; computer programs in assembly language, Sail, Simula, C, Fortran, Pascal, PL/I, etc; online documentation that had to be portable to hundreds of platforms; plain-text record-oriented databases -- mailing lists for example. There is no reason for most of this sort of information to be "rich" and that this type of work should not continue in Unicode. What is needed is emphatic allowance and support for Unicode plain text in the Unicode standard, i.e. a precise and thorough definition of what constitutes a self-contained preformatted plain-text document. This is primarily a matter of adopting a small but complete set of control codes needed for line breaks, paragraph breaks, page breaks, and direction control (most of these are already there), and a clear statement of the role of the "traditional" control characters at U+0000 - U+001F, U+007F, and U+0100 - U+011F. And outside the scope of the Unicode standard is the problem of properly tagging files in the file system. This has never been done right, on any operating system. The use of the "extension" (the part of the name after the dot, e.g. "DOC") is just plain silly, especially now that GUI-based operating systems are using this to associate applications with files -- click on a data file, launch the associated application on that file. What's silly about it is that anybody can name a file any way they please and there is no registration authority for extensions; conflicts inevitably arise -- sometimes with disastrous consequences. Even sillier is the idea the each file must belong to one and only one application. Plain text files can be used by many applications, but how do we mark them as being written in Unicode? Or Latin-1? Or JIS X 0208, etc. Ideally there should be information in the directory entry to specify the file type and encoding. That's an issue for each OS maker, but one whose resolution is long overdue. - Frank 28-May-97 23:08:37-GMT,7018;000000000001 Return-Path: Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA12580 for ; Wed, 28 May 1997 19:08:34 -0400 (EDT) Received: from localhost (jfieber@localhost) by fallout.campusview.indiana.edu (8.8.5/8.8.5) with SMTP id QAA04102; Wed, 28 May 1997 16:14:11 -0500 (EST) Date: Wed, 28 May 1997 16:14:10 -0500 (EST) From: John Fieber Reply-To: John Fieber To: Frank da Cruz cc: Multiple Recipients of Subject: Plain text vs. markup (was: re:Multi-Lingual Project Gutenberg) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Wed, 28 May 1997, Frank da Cruz wrote: > Nobody advocates stamping out higher level protocols, even if that were > possible. We all use them all the time. I, for one, use them with my > eyes open -- i.e. with full knowledge that all the work I put into > creating a "rich" document will need to be done again at some point when > the current "standard" for richness has been replaced by a new one if I > want the document to survive. And again. And again. > > I remember the excitement when it first became possible to produce > typeset-quality documents with Troff, R, DSR, Scribe, TeX, and their > relatives. The transient nature of these markup languages is not a trait of markup languages, but a product of having a one-to-one relationship between the markup language and a specific piece of application software. TeX files go with TeX, troff files go with troff, Scribe files go with Scribe, WordPerfect files go with WordPerfect, MS-Word files go with MS-Word. If the application falls out of favor, it takes its markup language and data with it. Exactly the same thing happens if you depend on software that uses its own unique character encoding, or the glyph encoding of some oddball font. It is percicely this fatal one-to-one markup/application relationship that SGML is targeted at. SGML is very different beast and it is a mistake to throw it in with the rest. Claiming that SGML is just another transient markup language that doesn't address document portability is similar to saying that Unicode is just another transient character encoding scheme that doesn't address multilingual computing. Absurd? Of course. > But I also continued to produce plain-text "documents" on a > daily basis: email; netnews; computer programs in assembly language, > Sail, Simula, C, Fortran, Pascal, PL/I, etc; I think we differ on the notion of "plain text" and "markup". Lets see. In email for example, what is the difference between this markup: From: jfieber@indiana.edu To: Whoever@somewhere Subject: la de da blah blah blah blah... and this markup: jfieber@indiana.edu Whoever@somewhere la de da blah blah blah blah... Semantically identical. Furthermore, the correct delivery of mail and news depends critically on markup as does netnews. However you delimit it, it is still markup. Same for the computer languages. What are braces, semicolons, parentheses, and comment delimiters in C if not markup to guide the compiler in parsing the program? Incidentally, most computer languages could be expressed in SGML markup (although the utility would be dubious). Unlike other markup languages, SGML makes no assumptions about the processing application. SGML merely provides a standard way for an application to distinguish markup from data. This allows SGML to be used as a foundation for a much broader range of applications and helps ensure a long life. On the other hand, as you may guess, SGML is not a complete solution--if typesetting is your domain, for example, you will still need some software to do the layout of your data (TeX works quite well)--but SGML serves to protect your data from dependencies on specific applications. That protection facilitates exchange between applications. In one case you feed your document to a typesetter, in another case you feed it to a database, in a third case, an on-line document viewer. Portability between applications extrapolates to portability across time. HTML may be out of fashion in 20 years, but any SGML compliant application can still process it even if the degigners never heard of HTML. (You might have to make up a style sheet, but that is orders of magnitude easier than the digital archaeology required to re-invent, say troff, from a couple sample document. SGML documents come with their own rosetta stone--the DTD, or document type definition.) In an SGML world, the data drives the application, not the other way around as is the status quo currently. That is the fundamental shift that sets SGML apart from the other markup languages cited here as examples of why markup languages are to be avoided when document portability is a concern. > What is needed is emphatic allowance and support for Unicode plain text > in the Unicode standard, i.e. a precise and thorough definition of what > constitutes a self-contained preformatted plain-text document. This is > primarily a matter of adopting a small but complete set of control codes > needed for line breaks, paragraph breaks, page breaks, and direction > control (most of these are already there), and a clear statement of the > role of the "traditional" control characters at U+0000 - U+001F, U+007F, > and U+0100 - U+011F. I think the notion of "plain text" is a little muddy as these sorts of codes represent markup that is conceptually no different than, say, SGML. I fully agree, however, that there is room and a historical precedent for a small set of control (markup) codes in Unicode, but getting people to agree on what constitues "complete" is another matter. :) I would propose that "complete" be defined as a minimal set of markup codes necessary to make a document understandable by a human without resorting to anything outside the Unicode standard. Machine processing, beyond doing the Right Thing with whitespace should not be a criteria. Except for directional control, most of the necessary markup should be covered by addressing compatibility with ASCII, although clarification would be helpful. > Plain text files can be used by many applications, but how do we mark > them as being written in Unicode? Or Latin-1? Or JIS X 0208, etc. SGML offers some options here by hiding file system (or any storage mechanism) behind an entity manager which provides for such tagging. The details are not currently covered by the standard (which treats the entity manager pretty much as a black box), but the entity manager in James Clark's SP system offers a good example of how it might be done. -john 28-May-97 23:24:54-GMT,4953;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id TAA14930 for ; Wed, 28 May 1997 19:24:53 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA29376; Wed, 28 May 97 15:30:28 -0700 Message-Id: <9705282230.AA29376@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2747 (1997-05-28 22:29:45 GMT) To: Multiple Recipients of Reply-To: Frank da Cruz From: "Unicode Discussion" Date: Wed, 28 May 1997 15:29:43 -0700 (PDT) Subject: Re: Plain text vs. markup (was: re:Multi-Lingual Project Gutenberg) > It is percicely this fatal one-to-one markup/application > relationship that SGML is targeted at. SGML is very different > beast and it is a mistake to throw it in with the rest. Claiming > that SGML is just another transient markup language that doesn't > address document portability ... > I don't think anybody did that. But this does not mean SGML can be used for everything. > Unlike other markup languages, SGML makes no assumptions about > the processing application. > Except that it can parse SGML. I'm not arguing against SGML -- quite the opposite: I'm heavily in favor of (almost) anything that has survived the international standards process AND sees use in the real world, as opposed to schemes that companies make up and unilaterally proclaim to be standards. But SGML is to mark up text for later formatting to fit the requirements of some output device or application that understands this kind of markup. As distinguished from plain text as we have known it since the 1960s, in which a repertoire of graphic characters is mixed with a small number of control codes (call them markup if you wish) for simple actions like line breaks and so on, in order to achieve the *final* result, not (necessarily) to be input for a higher-level reformatter. > I would propose that "complete" be defined as a minimal set of > markup codes necessary to make a document understandable by a > human without resorting to anything outside the Unicode standard. > Machine processing, beyond doing the Right Thing with whitespace > should not be a criteria. Except for directional control, most of > the necessary markup should be covered by addressing > compatibility with ASCII, although clarification would be > helpful. > Right. Something like the following (ignoring BIDI for the moment): . LS is a hard line break. The next graphic character appears at the left margin of the following line. Equivalent to CR and LF on a Teletype. . Two LSs result in a blank line. . Three LSs result in two blank lines, and so on. . PS is a hard paragraph break (more about this below). . (form separator), whatever its instantiation (a new Unicode character, or ASCII Formfeed with a well-defined use in Unicode), starts a new page. The next graphic character appears on the top line, leftmost position of the new page. . Two FSs result in a blank page, and so on. Plus whatever is needed for specifying writing direction, including expanding on what is meant by "left", "top", etc, in the preceding items. That should do it. Personally, I find text to be most portable when it is displayed in fixed-width font, and spaces are used to line things up, rather than tabs (because tabs require external agreement about the tab settings). I don't think Vertical Tab or other obscure formatting controls (such as Line Feed taken literally) are of any use; in my experience they have always been treated as "synonyms" for the controls listed above. Then what to do about ASCII controls in Unicode text? I'd say that since ASCII (and Latin-x, etc) must be converted to Unicode, then it is the responsibility of the conversion agent to understand the local conventions for line breaks (etc) in the source text, and to convert to the well-defined Unicode controls. About Paragraph Separator... It seems to me that this one was designed with the "export from word processor" type of file in mind (those files we were discussing earlier in which each paragraph is a long line, terminated by a "paragraph separator" such as CR). I would not call this type of file plain text -- I would call it "input for a text formatter"; it needs further processing to be readable. (For example, if I print such a file on the local Laserwriter, the long lines are truncated -- thus I only see the first 80 characters of each paragraph.) Clearly we can become increasingly epistemological about what constitutes plain text (yes, C source code is input for a C compiler, but it is also text to be read, understood, and edited by people, sent by email without being reformatted, etc). And obviously some details still need working out: treatment of soft hyphens and such. But I think we're on the right track. - Frank 29-May-97 4:14:06-GMT,3937;000000000001 Return-Path: Received: from fallout.campusview.indiana.edu (fallout.campusview.indiana.edu [149.159.1.1]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id AAA24371 for ; Thu, 29 May 1997 00:14:05 -0400 (EDT) Received: from localhost (jfieber@localhost) by fallout.campusview.indiana.edu (8.8.5/8.8.5) with SMTP id XAA06831; Wed, 28 May 1997 23:14:04 -0500 (EST) Date: Wed, 28 May 1997 23:14:03 -0500 (EST) From: John Fieber To: Frank da Cruz cc: Multiple Recipients of Subject: Re: Plain text vs. markup (was: re:Multi-Lingual Project Gutenberg) In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Wed, 28 May 1997, Frank da Cruz wrote: > > It is percicely this fatal one-to-one markup/application > > relationship that SGML is targeted at. SGML is very different > > beast and it is a mistake to throw it in with the rest. Claiming > > that SGML is just another transient markup language that doesn't > > address document portability ... > I don't think anybody did that. But this does not mean SGML can > be used for everything. No, but its useful range of applications is quite a bit wider than any other markup scheme I know of. That helps a lot in building a solid foundation that won't fade away. > But SGML is to mark up text for later formatting to fit the > requirements of some output device or application that understands > this kind of markup. SGML is explicitly *not* about text formatting. It is about marking up documents describing what content *is*, not what to do with it. If markup represents typesetting instructions, that markup is good for little else. If your markup describes what the content is, you have far more options. For example, the introduction of a new term in a technical manual may be rendered in italics. You could mark it up like: new term which would be fine if the end target is a typesetter, but if you mark it up with: new term, you can still render it as italic, but you can also automatically add it to the index as the defining location of the term, or in an on-line environment if you encounter a unfamiliar term, the search engine can seek out the defining occurence if it exists. But back to your point: > As distinguished from plain text as we have ... > so on, in order to achieve the *final* result, not (necessarily) to > be input for a higher-level reformatter. Yes, though I would argue at length why SGML markup is well worth the extra effort, I'll also agree that this minimalist approach to document portability deserves support. > Then what to do about ASCII controls in Unicode text? I'd say > that since ASCII (and Latin-x, etc) must be converted to Unicode, > then it is the responsibility of the conversion agent to > understand the local conventions for line breaks (etc) in the > source text, and to convert to the well-defined Unicode controls. The only hitch for 7-bit ASCII is utf-8, which can be seen as a convenient way to avoid the explicit conversion process of legacy data. If your external storage is utf-8, how can you reliably tell what has been converted and what has not? > Clearly we can become increasingly epistemological about what > constitutes plain text (yes, C source code is input for a C > compiler, but it is also text to be read, understood, and edited > by people, sent by email without being reformatted, etc). After pondering it for awhile, I cut that section out of my last post. :) One sentence summary: some markup scheme cater to human processing, others to machine processing, and yet others, most notably programming languages, work hard to satisfy both needs. -john 29-May-97 14:32:08-GMT,1718;000000000001 Return-Path: Received: from unicode.org (unicode.org [192.195.185.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id KAA14517 for ; Thu, 29 May 1997 10:32:07 -0400 (EDT) Received: by unicode.org (NX5.67g/NX3.0M) id AA01680; Thu, 29 May 97 06:53:02 -0700 Message-Id: <9705291353.AA01680@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2753 (1997-05-29 13:52:26 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Thu, 29 May 1997 06:52:24 -0700 (PDT) Subject: Re: Plain text vs. markup (was: re:Multi-Lingual Project Gutenberg) In message "Re: Plain text vs. markup (was: re:Multi-Lingual Project Gutenberg)", 'fdc@watsun.cc.columbia.edu' writes: > Right. Something like the following (ignoring BIDI for the moment): > ... (details removed) BIDI is what I think makes it difficult. Without BIDI, I would be tempted to stick to local Unix/MAC/DOS conventions for C0 chars, add maybe BOM and ISS (or whatever). But BIDI works in blocks. Currently both LS and PS are block separators. It's been said here that probably LS shouldn't be a BIDI block separator. That leaves PS. And I have to use it (in partic. if I have both right- and left-aligned sections). So can I mix PS with LS (or LF) and FF? Looks funny. Maybe it is an error to have PS function as both a paragraph separator (whatever that is -- I too feel it probably comes from WP context) *and* a BIDI block separator. Maybe it would have been better to have a BIDI block separator as a separate Unicode control char, independant of any formatting intents. Just a thought, Pierre 6-Jun-97 2:39:06-GMT,3185;000000000001 Return-Path: Received: from mail-out1.apple.com (A17-254-0-52.apple.com [17.254.0.52]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id WAA10937 for ; Thu, 5 Jun 1997 22:39:05 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by mail-out1.apple.com (8.8.5/8.8.5) with SMTP id TAA14624; Thu, 5 Jun 1997 19:25:14 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA00269; Thu, 5 Jun 97 19:21:45 -0700 Message-Id: <9706060221.AA00269@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-2022-jp Content-Transfer-Encoding: 7bit X-Uml-Sequence: 2832 (1997-06-06 02:21:05 GMT) To: Multiple Recipients of Reply-To: Adrian Havill From: "Unicode Discussion" Date: Thu, 5 Jun 1997 19:21:03 -0700 (PDT) Subject: Re: Comments on ? Tim Partridge wrote: > I agree with his point of view that the tags > should be at the character level and not just > in the UTF-8 format. > > How about using Escape sequences? Ugh. The relatively few escape sequences at the character level is what makes Unicode so ATTRACTIVE, esp. to those that currently use escape sequence based character sets. (Tools to repair broken escape codes in JIS are almost standard equipment with most Japanese computer systems) Not to mention the complexity they add to simple and elegant string manipulation functions... processing escape codes can sometimes bump the algorithm efficiency up by one O() level. Put in escape codes at the character level, and Unicode begins to lose the simplicity factor, and becomes just another mammoth character set that nobody can or will implement--there are plenty out there. If I wanted escape sequences, I could choose from a lot of other character sets that are already out there. If you want a complicated character system that does tags and everything, there are plenty to choose from-- Unicode basher Prof. Ken Sakamura (U. of Tokyo) and Co. would be more than happy to tout the virtues of TRON, which is loaded with escape sequences galore. The TRON project has made a religion out of bad-mouthing Unicode, much like the computer industry has made a religion out of bad-mouthing a certain software firm in Redmond, Washington (who make a darn fine Unicode based OS, I might add). They have to-- they have to justify that the years of blood, sweat, tears (and most importantly, money) they've used making -their- worldwide standard character set has not repeated work that's already here and in use and better. (see and ) Granted, Unicode is complicated. It will get more complicated. This is a fact of life as representing languages is complicated. But I'd hope the character level stays as simple as possible, for those that need simplicity. I do NOT agree that tags should be at the character level. -- Adrian Havill Engineering Division, System Planning & Production Section 6-Jun-97 14:37:44-GMT,2109;000000000011 Return-Path: Received: from mail-out1.apple.com (A17-254-0-52.apple.com [17.254.0.52]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id KAA18800 for ; Fri, 6 Jun 1997 10:37:43 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by mail-out1.apple.com (8.8.5/8.8.5) with SMTP id HAA12466; Fri, 6 Jun 1997 07:22:02 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA02593; Fri, 6 Jun 97 07:16:28 -0700 Message-Id: <9706061416.AA02593@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2847 (1997-06-06 14:14:18 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Fri, 6 Jun 1997 07:14:17 -0700 (PDT) Subject: Re: Comments on ? In message "Re: Comments on ?", 'glenn@spyglass.com' writes: > I'd like to briefly summarize some of the positions taken on various > sides in this discussion. Thanks, very useful (esp. for one who didn't have the time to read all the posts carefully). I haven't read the MLSF yet (will do this weekend), but I'm sure I still won't agree with putting this tagging in UTF-8. UTF-8 is nothing more than one of many possible transformation formats, and it must always be possible to move between it and UCS-2 and other UTFs. Filters surely will (and almost certainly already do) exist to transform between these various CESs. What would they do with language tagging? > My personal position on the above is that an alternative non-UCD (i.e., > standard code assignment) approach is preferred. Its only negatives are > (a) opposition from (1) above and (b) the time required to make actual > code assignments. Sounds to me like the only possible approach, assuming language tagging is needed at the plain-text level (I don't have the knowledge to comment on that). Pierre P.S. What happened to the "unicode plain-text file" thread? Seems it died very suddenly (with no closure)! Maybe it was displaced by this new thread :-). 6-Jun-97 15:15:46-GMT,1789;000000000001 Return-Path: Received: from mail-out1.apple.com (A17-254-0-52.apple.com [17.254.0.52]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id LAA26420 for ; Fri, 6 Jun 1997 11:15:45 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by mail-out1.apple.com (8.8.5/8.8.5) with SMTP id IAA11222; Fri, 6 Jun 1997 08:03:32 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA02791; Fri, 6 Jun 97 07:57:13 -0700 Message-Id: <9706061457.AA02791@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2848 (1997-06-06 14:56:16 GMT) To: Multiple Recipients of Reply-To: Frank da Cruz From: "Unicode Discussion" Date: Fri, 6 Jun 1997 07:56:15 -0700 (PDT) Subject: Re: Comments on ? > P.S. What happened to the "unicode plain-text file" thread? Seems it > died very suddenly (with no closure)! Maybe it was displaced by this > new thread :-). > It seems as if this is trying to become a plain-text issue. I hope not. Plain text is supposed to be a simple sequence of *characters* and minimal formatting information (hard spaces, line breaks, page breaks, and in the case of Unicode, directionality indicators), irrespective of language, containing no mysterious metacodes. (Let's agree that hard line and page breaks are not mysterious metacodes.) In view of the temperature surrounding the language-tagging issue, the solution is not going to be simple or stable or soon to come, and therefore I believe it falls outside the scope of plain text, which by definition should be simple and stable and long-lasting. Language tags will be constantly changing and surrounded by politics and emotion. - Frank 7-Jun-97 16:38:53-GMT,1467;000000000001 Return-Path: Received: from mail-out2.apple.com (A17-254-0-51.apple.com [17.254.0.51]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id MAA02679 for ; Sat, 7 Jun 1997 12:38:52 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by mail-out2.apple.com (8.8.5/8.8.5) with SMTP id JAA07384; Sat, 7 Jun 1997 09:27:30 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA07340; Sat, 7 Jun 97 09:24:48 -0700 Message-Id: <9706071624.AA07340@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2857 (1997-06-07 16:24:32 GMT) To: Multiple Recipients of Reply-To: Frank da Cruz From: "Unicode Discussion" Date: Sat, 7 Jun 1997 09:24:30 -0700 (PDT) Subject: Re: Plane 14 codes for language tagging? > > > My personal preference is for number 2. I kind of like Martin's proposal > > > for introducing a plain-text language tag using a control code, and I > > > think the existing control codes are fine. > > Good idea. Indeed the C1 area is not used in the Internet as far as I know. > There are still such things as terminals that use C1 control codes such as CSI, APC, OSC, etc (primarily VT220 and higher, which are the predominant types used by emulators such Kermit, Xterm, DECterm, etc). Do we intend that Unicode and terminal-to-host communication will become mutually exclusive concepts? - Frank 7-Jun-97 17:14:31-GMT,1996;000000000011 Return-Path: Received: from josef.ifi.unizh.ch (josef.ifi.unizh.ch [130.60.48.10]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with SMTP id NAA07882 for ; Sat, 7 Jun 1997 13:14:30 -0400 (EDT) Received: from ifi.unizh.ch by josef.ifi.unizh.ch with SMTP (PP) id <18036-0@josef.ifi.unizh.ch>; Sat, 7 Jun 1997 19:14:30 +0200 Date: Sat, 7 Jun 1997 19:14:28 +0200 (MET DST) From: "Martin J. Duerst" Sender: mduerst@enoshima To: Frank da Cruz cc: Multiple Recipients of , MLSF discussion -- IETF Languages , Multiple Recipients of Subject: Re: Plane 14 codes for language tagging? In-Reply-To: Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII On Sat, 7 Jun 1997, Frank da Cruz wrote: > > > > My personal preference is for number 2. I kind of like Martin's proposal > > > > for introducing a plain-text language tag using a control code, and I > > > > think the existing control codes are fine. > > > > Good idea. Indeed the C1 area is not used in the Internet as far as I know. > > > There are still such things as terminals that use C1 control codes such as > CSI, APC, OSC, etc (primarily VT220 and higher, which are the predominant > types used by emulators such Kermit, Xterm, DECterm, etc). Do we intend that > Unicode and terminal-to-host communication will become mutually exclusive > concepts? Frank - I understand your concerns. But one way of looking at what we need is some tagging format possibly used in ACAP and IMAP, which MUST not leak to other places. And what you probably worry about is the C1 area in terms of octets (which is already gone with UTF-8) and not the C1 character space in Unicode, which turns up as two bytes in UTF-8. Regards, Martin. 8-Jun-97 8:27:08-GMT,2730;000000000001 Return-Path: Received: from mail-out1.apple.com (mail-out1.apple.com [17.254.0.52]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id EAA06491 for ; Sun, 8 Jun 1997 04:27:07 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by mail-out1.apple.com (8.8.5/8.8.5) with SMTP id BAA08438; Sun, 8 Jun 1997 01:14:24 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA09568; Sun, 8 Jun 97 01:11:13 -0700 Message-Id: <9706080811.AA09568@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 2866 (1997-06-08 08:10:50 GMT) To: Multiple Recipients of Reply-To: "Pierre Lewis" From: "Unicode Discussion" Date: Sun, 8 Jun 1997 01:10:49 -0700 (PDT) Subject: Re: Comments on Finally got around to reading the MLSF Internet Draft. Couple of comments: 1) One thing really made me jump: the first sentence in the Abstract. "While UTF-8 solves most internationalization (I18N) problems, ..." That makes as much sense to me as saying that QuotedPrintable solves most I18N problems for Western Europe. It's not QP which does that, it's ISO 8859-1. QP is just one way to encode 8859-1 text so it can past most mail relays without corruption. But Base64 is another way to do the same thing (which can make statistical sense for some languages). Similarly, it's not UTF-8 which solves the wider problem of world-wide I18N, it's Unicode (and/or ISO 10646). The canonical representation of Unicode is 16-bit quantities (UCS-2). UTF-8 is nothing more than one of many possible transformations (UTF-7 is another that's already defined: RFC 2152). If I understood right, UTF-8 was created mainly to make Unicode coexist reasonably well with existing OSs that use 8-bit characters, for example Unix. Not that I agree with the proposal, but the MLSF Internet Draft should make clear what the implications are of trying to put language tags into UTF-8 (for example, assumption that UTF-8 becomes the canonical representation of Unicode, loss of tagging when converting to other CESs). I guess the pros and cons have been discussed at length here. 2) It would have been nice to put a few examples of actual UTF-8 strings with language tags (in hex of course) in the document. As to the fundamental issue of whether language tagging belongs in plain-text Unicode, I must say I'm pretty neutral at this point. I think they could be useful. But, as Frank was saying, if it's going to take 10 years to converge to an acceptable solution, then it doesn't belong in plain text, but at a higher level. Pierre 9-Jun-97 3:10:12-GMT,1193;000000000001 Return-Path: Received: from cam.spyglass.com (sapir.cam.spyglass.com [208.203.148.66]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA24496 for ; Sun, 8 Jun 1997 23:10:11 -0400 (EDT) Received: from mykhe.cam.spyglass.com (shivacam-1.cam.spyglass.com [208.203.149.181]) by cam.spyglass.com (8.7.5/8.7.3) with SMTP id XAA00525 for ; Sun, 8 Jun 1997 23:10:22 -0400 (EDT) Message-Id: <3.0.32.19970608224316.006e9e50@mailhost.cam.spyglass.com> X-Sender: glenn@mailhost.cam.spyglass.com X-Mailer: Windows Eudora Pro Version 3.0 (32) Date: Sun, 08 Jun 1997 22:57:16 -0400 To: Frank da Cruz From: Glenn Adams Subject: Re: Plane 14 codes for language tagging? Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" At 10:32 AM 6/7/97 -0700, you wrote: >and escape sequences would take in a "Unicode terminal"? Would it use >octets or hextets? The Unicode standard is clear that escape sequences and controls in canonical Unicode are encoded using 16-bit codes. Of course another encoding system which employs Unicode may choose a different tack. G. 4-Jul-97 0:38:37-GMT,4502;000000000001 Return-Path: Received: from mail-out2.apple.com (mail-out2.apple.com [17.254.0.51]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id UAA01503 for ; Thu, 3 Jul 1997 20:38:35 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by mail-out2.apple.com (8.8.5/8.8.5) with SMTP id RAA37606; Thu, 3 Jul 1997 17:27:11 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA11841; Thu, 3 Jul 97 17:22:24 -0700 Message-Id: <9707040022.AA11841@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 3064 (1997-07-04 00:22:02 GMT) To: Multiple Recipients of Reply-To: Randy Presuhn From: "Unicode Discussion" Date: Thu, 3 Jul 1997 17:22:01 -0700 (PDT) Subject: UTF-8 in SNMPv3 Hi - The SNMPv3 working group of the IETF is hoping to make use of UTF-8 for some human-readable information in the MIBs used to manage SNMPv3. The convention currently used for this kind of information is described on page 4 of RFC 1903. (For easy reference, I've appended the text to the end of this message.) We would like to define a new convention formulated in terms of UTF-8 for use in new MIBs. What we've not yet reached agreement on is the question of "non-printable stuff". Some believe that NVT ASCII's control characters are somehow less problematic than those of 10646, others find the problems equivalent. The questions that come to my mind are: 1) Is there any merit to the argument that the "non-printable stuff" in 10646 is any better or worse than the NVT ASVII definition? 2) Can we use standard character properties to identify a "printable" subset that would not break for any language? (The folks that want these also want to have CRLF...) Background information: In the SNMP protocol notions of equality and ordering have no "locale" component. There is no notion of character equivalence. It is very much a "bits is bits" environment. The concerns of working group members appear to be arising from: 1) what does it mean to "support 10646" 2) how to display "wierd stuff" 3) how to input "wierd stuff" 4) the old CR/LF problem Is there a nice, concise, convincing answer I can take back to the working group? ========== Excerpt from RFC 1903, DisplayString Textual convention ========== "Represents textual information taken from the NVT ASCII character set, as defined in pages 4, 10-11 of RFC 854. To summarize RFC 854, the NVT ASCII repertoire specifies: - the use of character codes 0-127 (decimal) - the graphics characters (32-126) are interpreted as US ASCII - NUL, LF, CR, BEL, BS, HT, VT and FF have the special meanings specified in RFC 854 - the other 25 codes have no standard interpretation - the sequence 'CR LF' means newline - the sequence 'CR NUL' means carriage-return - an 'LF' not preceded by a 'CR' means moving to the same column on the next line. - the sequence 'CR x' for any x other than LF or NUL is illegal. (Note that this also means that a string may end with either 'CR LF' or 'CR NUL', but not with CR.) Any object defined using this syntax may not exceed 255 characters in length." ========== End Excerpt =============== --------------------------------------------------------------------- Randy Presuhn BMC Software, Inc. (Silicon Valley Division) Voice: +1 408 556-0720 (Formerly PEER Networks) http://www.bmc.com Fax: +1 408 556-0735 1190 Saratoga Avenue, Suite 130 Email: rpresuhn@bmc.com San Jose, California 95129-3433 USA --------------------------------------------------------------------- In accordance with the BMC Communications Systems Use and Security Policy memo dated December 10, 1996, page 2, item (g) (the first of two), I explicitly state that although my affiliation with BMC may be apparent, implied, or provided, my opinions are not necessarily those of BMC Software and that all external representations on behalf of BMC must first be cleared with a member of "the top management team." --------------------------------------------------------------------- 30-Jun-99 19:29:47-GMT,1992;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id PAA19372 for ; Wed, 30 Jun 1999 15:29:45 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id MAA342738 ; Wed, 30 Jun 1999 12:18:25 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA07842; Wed, 30 Jun 99 12:01:45 -0700 Message-Id: <9906301901.AA07842@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8249 (1999-06-30 19:01:34 GMT) From: Frank da Cruz To: Unicode List Cc: Unicode List Date: Wed, 30 Jun 1999 12:01:33 -0700 (PDT) Subject: Re: Unicode selections for X11 (cont'd) Juliusz Chroboczek wrote: > I've got a question about the C0 and C1 control character ranges. > I call them `legacy control characters'. Do people object to this > terminology? > I hope so! The word "legacy" is emotionally toned and value-laden. It denigrates 30+ years of computing practice and standards activities, and it implies that plain text is a relic of the past to be discarded with all possible haste, and those who haven't done so yet have some sort of "character" defect. In fact, plain text is the only immutable format in computing. GUI and WYSIWYG formats change faster than anybody can keep up with them, and information encoded in these formats rapidly becomes inaccessible (or accessible only by utilities (like UNIX "strings") that extract the plain text from them, if there is any). > Does anyone have a better name? > C0 and C1 control characters. These are ISO standard character sets and ISO-standard terminology is available to refer to them. Finally, please remember that Unicode is a plain-text standard. The control characters are there for a reason: you need them in plain text. - Frank 30-Jun-99 19:54:27-GMT,2968;000000000011 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id PAA29133 for ; Wed, 30 Jun 1999 15:54:26 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id MAA188082 ; Wed, 30 Jun 1999 12:50:57 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA08427; Wed, 30 Jun 99 12:36:54 -0700 Message-Id: <9906301936.AA08427@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 (generated by tm-edit 7.104) Content-Type: text/plain; charset=US-ASCII X-Uml-Sequence: 8252 (1999-06-30 19:36:20 GMT) From: Juliusz Chroboczek To: Unicode List Date: Wed, 30 Jun 1999 12:36:18 -0700 (PDT) Subject: Re: Unicode selections for X11 (cont'd) >> I've got a question about the C0 and C1 control character ranges. >> I call them `legacy control characters'. Do people object to this >> terminology? Frank da Cruz : FdC> I hope so! The word "legacy" is emotionally toned and FdC> value-laden. It denigrates 30+ years of computing practice and FdC> standards activities, and it implies that plain text is a relic FdC> of the past to be discarded with all possible haste, It cannot be said that the C0 and C1 control characters are the greatest achievement of these ``30+ years etc.'' FdC> In fact, plain text is the only immutable format in computing. Agreed. And the only reason it is not portable is the poor standardisation of the C0 and C1 control characters. I've seen the following forms of plain text: NL is a line break, there's no paragraphs: Unix NL is a line break, NL NL is a paragraph separator: Unix NL is a paragraph separator, line breaks are implicit: ports of MS-DOS applications to Unix. CR LF is a line break: MS-DOS CR LF is a paragraph separator, line breaks are implicit: MS-DOS. CR LF is a paragraph separator, CR (or was it LF?) is a line break: MS-DOS. CR is a line break: MacOS. CR is a paragraph separator: MacOS. without counting, of course, systems on which record information is kept out-of-band (such as VMS). >> Does anyone have a better name? FdC> C0 and C1 control characters. These are ISO standard character FdC> sets and ISO-standard terminology is available to refer to them. Okay. Changed. FdC> Finally, please remember that Unicode is a plain-text standard. FdC> The control characters are there for a reason: you need them in FdC> plain text. You need a paragraph separator and possibly a line break (and perhaps a page break). Unicode defines well-standardised codepoints for those. If you use other control characters, such as SO/SI for controlling boldface or italics, or BS (or CR) for overstriking, or terminal control sequences, it ain't plain text no more. J. 30-Jun-99 20:08:23-GMT,4025;000000000011 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA03904 for ; Wed, 30 Jun 1999 16:08:22 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id MAA200106 ; Wed, 30 Jun 1999 12:52:24 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA08371; Wed, 30 Jun 99 12:35:22 -0700 Message-Id: <9906301935.AA08371@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8250 (1999-06-30 19:35:08 GMT) From: Asmus Freytag To: Unicode List Date: Wed, 30 Jun 1999 12:35:07 -0700 (PDT) Subject: Re: Superscript asterisk Being able to do "plain text" math is one of the goals of the Unicode Technical Committee now. Since the publication of Unicode 2.0, three years ago, we have had a lot of expert input on what plain text math capabilities are needed, and also, where our existing repertoire of math operators is insufficient. (We are, incidentally, also interested in evaluating and improving our other technical symbol collections, but so far have not had the long and sustained input from experts in other fields, as we had for mathematics). Full layout of mathematical expressions will need some form of markup, although many formulas that do not need the full generality can be laid out correctly if the mathematical operator characters in Unicode are interpreted semantically. Semantics for formatting that one needs to distinguish e.g. between summation sign and sigma. They look the same, but summation sign can take limit expressions etc. Another aspect of semantics is the mathematical semantics. Here it's necessary to make enough distinctions so that, if a small and large form of an operator can occur in the same text, that they can be distinguished by their character code without recourse to font information. Doing so, allows plain text searches for math formula. Caveat: If and where mathematicians have used 'operator overloading', to borrow a C++ term, and deliberately used the same operator with different mathemtical meaning in another sub-discipline, we would not sub-divide the character, as the larger context would be enough to determine its meaning. Our foremost goal has therefore been to complete our repertoire and where necessary introduce additional distinctions for the two reasons I mentioned. In the case of ASTERISK, the analysis that is needed, and that, as far as I have seen, has not been made, is to present evidence that cases exist (or are easily conceivable) where *both* the ASCII asterisk and yet another asterisk are needed in the same text, and with consistent distinction in use or formatting. Ricardo has said that one could use the proposed asterisk in conjunction with the ASCII asterisk do denote a regular expression of zero or more asterisks. This is the one example that cannot serve, since by extension, it would require an infinite series of asterisks (suppose I wanted to define a regular expression consisting of zero or more instances of the proposed asterisk!). Typographically, asterisk may indeed show a variation betweem full-size and superscript forms. For standard text fonts, the full-size form of asterisk occurs only occasionally. In the vast majority of fonts on my system, as well as in the Unicode Standard, and ISO/IEC10646-1, ASTERISK is clearly depicted as a superscripted symbol (i.e. it's 1/2 height and extends upwards from the centerline of the font, which is just slightly below the x height). The asterisk and superscript 2 have the same location and dimension. Therefore, unless Ricardo is proposing a character that has the same dimension as a *superscripted* SUPERSCRIPT TWO, my conclusion would be that we already _have_ the character he wants, and that he is using a poor font for his purpose. A./ 30-Jun-99 20:24:18-GMT,2893;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA08593 for ; Wed, 30 Jun 1999 16:24:17 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id NAA188518 ; Wed, 30 Jun 1999 13:10:34 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA08959; Wed, 30 Jun 99 13:00:48 -0700 Message-Id: <9906302000.AA08959@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8253 (1999-06-30 20:00:25 GMT) From: Frank da Cruz To: Unicode List Cc: unicode@unicode.org Date: Wed, 30 Jun 1999 13:00:24 -0700 (PDT) Subject: Re: Unicode selections for X11 (cont'd) > It cannot be said that the C0 and C1 control characters are the > greatest achievement of these ``30+ years etc.'' > Actually they served us all rather well considering how few of them there are and how long they lasted (and continue to last). We've covered this ground before... But (to cite only one example) do you know how many terminals and terminal emulators are "still" in use? I would venture to say the number has not declined significantly since the 1980s. It might well have increased. It's just that they are no longer the *only* form of online access, and they work well, so we ignore them. > FdC> In fact, plain text is the only immutable format in computing. > > Agreed. And the only reason it is not portable is the poor > standardisation of the C0 and C1 control characters. > The CR/LF/CRLF confusion is annoying of course, but we've lived with it all these years, and continue to live with it. But you're talking about file formats. The use of control characters in data communications is fairly well standardized, pretty much along the lines of a Teletype: CR moves the print head to the left margin, LF moves it down one line, and ESC introduces a device-dependent escape or control sequence, etc. > FdC> Finally, please remember that Unicode is a plain-text standard. > FdC> The control characters are there for a reason: you need them in > FdC> plain text. > > You need a paragraph separator and possibly a line break (and perhaps > a page break). Unicode defines well-standardised codepoints for > those. If you use other control characters, such as SO/SI for > controlling boldface or italics, or BS (or CR) for overstriking, or > terminal control sequences, it ain't plain text no more. > But Unicode and the terminal acess model are not mutually exclusive. There can be (and are) Unicode-based terminal emulators, capable of handling (e.g.) UTF-8 on the wire. And when you have terminal communications, you have control characters. (When you emulate, say, a VT320, you have LOTS of control characters :-) - Frank 30-Jun-99 21:45:11-GMT,2978;000000000011 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id RAA04617 for ; Wed, 30 Jun 1999 17:45:10 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id OAA339770 ; Wed, 30 Jun 1999 14:33:51 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA09734; Wed, 30 Jun 99 14:17:35 -0700 Message-Id: <9906302117.AA09734@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Uml-Sequence: 8255 (1999-06-30 21:17:25 GMT) From: Markus Kuhn To: Unicode List Date: Wed, 30 Jun 1999 14:17:23 -0700 (PDT) Subject: Re: Plain Text Juliusz Chroboczek wrote on 1999-06-30 19:36 UTC: > You need a paragraph separator and possibly a line break (and perhaps > a page break). Unicode defines well-standardised codepoints for > those. If you use other control characters, such as SO/SI for > controlling boldface or italics, or BS (or CR) for overstriking, or > terminal control sequences, it ain't plain text no more. The only thing that is clear about "plain text" is that it is not well defined at all. There is certainly no ISO standard that gives you any indication of what "plain text" is. The Unix community feels somewhat confident about the notion of plain text, just because they have editors such as ed, vi, emacs, etc. that agree on a common text format that is so simple that it has become customary to refer to it as plaintext. Many aspects of "plain text" are ill-defined these days: a) how do you terminate lines and paragraphs b) is there a terminator after the last line/paragraph c) is the line formatting the task of the sending or the receiving process? For Unix the answers used to be a) LF and no paragraph concept b) yes c) the sender has to insert line breaks but thanks to the heterogenity of the Internet, these strict rules have for some years been weakened significantly in common practice. Some aspects of the classical Unix plaintext definition (which came originally from tty output hardware interfaces) do not make sense any more. For example, the insertation of LFs in the middle of paragraphs, causes these LFs to move around whenever a few words are changed, which seriously disrupts revision control systems (e.g., diff and RCS) and it is not adequate anymore at all today with reformatting web browsers now being a dominating output device and not 1960s ttys. I think the Unix community should slowly get used to the idea of abandoning LFs in the middle of paragraphs in plain text documents and let the editor and display tool perform the reformatting at display time. Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: 30-Jun-99 22:46:24-GMT,2237;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id SAA22071 for ; Wed, 30 Jun 1999 18:46:24 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id PAA187464 ; Wed, 30 Jun 1999 15:36:20 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA10018; Wed, 30 Jun 99 15:25:38 -0700 Message-Id: <9906302225.AA10018@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8256 (1999-06-30 22:25:27 GMT) From: John Cowan To: Unicode List Date: Wed, 30 Jun 1999 15:25:26 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 7bit Markus Kuhn scripsit: > The only thing that is clear about "plain text" is that it is not well > defined at all. There is certainly no ISO standard that gives you any > indication of what "plain text" is. What a pity. Perhaps there should be one (no :-)). > The Unix community feels somewhat > confident about the notion of plain text, just because they have editors > such as ed, vi, emacs, etc. that agree on a common text format that is > so simple that it has become customary to refer to it as plaintext. The notion of plain text long predates Unix: it was exactly the same, for example, on the PDP-8, which is where I first learned computing. (Terminator was CR/LF, and the character code was 7-bit-ASCII-with-8th-bit- set, for uniformity with Model 33 Teletypes). > I think the Unix community should slowly get used to the idea of > abandoning LFs in the middle of paragraphs in plain text documents and > let the editor and display tool perform the reformatting at display > time. AFAIK, the "reformatting web browsers" you refer to do not reformat plain text at all, which means that infinite-line-length alleged plain text can be read only with difficulty and much scrolling, and printing is impossible. -- John Cowan cowan@ccil.org I am a member of a civilization. --David Brin 30-Jun-99 22:54:43-GMT,3347;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id SAA23739 for ; Wed, 30 Jun 1999 18:54:43 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id PAA57670 ; Wed, 30 Jun 1999 15:46:45 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA10105; Wed, 30 Jun 99 15:33:04 -0700 Message-Id: <9906302233.AA10105@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8257 (1999-06-30 22:32:56 GMT) From: Frank da Cruz To: Unicode List Cc: unicode@unicode.org Date: Wed, 30 Jun 1999 15:32:55 -0700 (PDT) Subject: Re: Plain Text > The only thing that is clear about "plain text" is that it is not well > defined at all. > Actually, it tends to be well-defined for each platform. And then the interchange methods among platforms tend to converge on a few simple conventions: ASCII (or the appropriate ISO character set, or now UTF-8 or other form of Unicode), as opposed to EBCDIC (or Baudot, or Sixbit); CRLFs separating lines, and paragraphs separated by blank lines. Somewhat less well defined, but nevertheless in common use, are bare Carriage Return or Backspace for overstriking, Formfeed for "new page", and Tab for tabbing (with several different conventions about tabstops). Lines are terminated at somewhere between 72 and 80 characters by convention, because that's how wide terminal screens are, and before them the Teletype carriage, and before that the most common kind of punchcard. Or for that matter, typewriters and sheets of paper (A4 or US, take your pick :-) To this day, we follow these conventions in newsgroups and email, although now it might be more a matter of "netiquette" than necessity (as in the BITNET days, when e-mail was, quite literally, 80-column card images). These simple conventions let us format our text exactly the way we want to. We can indent or not, we can put line breaks where we want them, we can have columns of numbers or other tabular presentations, mathematical expressions, and idiosyncratic forms of emphasis. Many people want their text to stay the way they wrote it. And many people also are not fond of receiving email in every kind of bizarre format than any application developer can dream up when it contains, in fact, nothing but words (but I stray). > I think the Unix community should slowly get used to the idea of > abandoning LFs in the middle of paragraphs in plain text documents and > let the editor and display tool perform the reformatting at display > time. > But what IS plain text? Maybe some people might like to have their email reformatted, but I don't think they want their C or Fortran or PostScript programs to receive the same treatment. Nor, for that matter poetry or any other forms of text where line breaks, indentation, and blank lines serve a purpose. As in, for example, the preceding paragraph. No more plain-text bashing! No more "legacy" saying! Our focus should be not on stamping out plain text, but on promoting international multilingual communication through a universal character set that does not impose a a particular modus vivendi upon its users. - Frank 30-Jun-99 23:19:45-GMT,1376;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA26033 for ; Wed, 30 Jun 1999 19:19:43 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id QAA258932 ; Wed, 30 Jun 1999 16:07:28 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA10518; Wed, 30 Jun 99 15:53:44 -0700 Message-Id: <9906302253.AA10518@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8258 (1999-06-30 22:53:34 GMT) From: John Cowan To: Unicode List Date: Wed, 30 Jun 1999 15:53:33 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 7bit Frank da Cruz scripsit: > No more plain-text bashing! No more "legacy" saying! Our focus should be > not on stamping out plain text, but on promoting international multilingual > communication through a universal character set that does not impose a > a particular modus vivendi upon its users. Hear, hear! Unicode (n.): The *last* legacy character set. -- John Cowan cowan@ccil.org I am a member of a civilization. --David Brin 1-Jul-99 20:12:49-GMT,3132;000000000001 Return-Path: Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id QAA02796; Thu, 1 Jul 1999 16:11:21 -0400 (EDT) Date: Thu, 1 Jul 99 16:11:21 EDT From: Frank da Cruz To: Otto Stolz cc: unicode@unicode.org Subject: Re: Plain Text In-Reply-To: Your message of Thu, 1 Jul 1999 03:57:30 -0700 (PDT) Message-ID: > Am 1999-06-30 um 14:17 h PDT hat Markus Kuhn geschrieben: > > The only thing that is clear about "plain text" is that it is not well > > defined at all. > > Am 1999-06-30 um 15:32 h PDT hat Frank da Cruz geschrieben: > > Actually, it tends to be well-defined for each platform. > > In MS-DOS (or PC-DOS and other DOS variants) on the PC, it is not > well defined, at all: > Not to prolong this discussion, which took place once before, at great length, in May to July 1997... > - '0D0A'x (CR+LF) means either line-break or pararaph separator, > When/if it means pararaph separator it's not plain text. Plain text is what you TYPE at the DOS prompt. In such files (e.g. a READ.ME file) CRLF means Carriage Return (move the cursor to the left margin) and Line Feed (move the cursor down one row). > - '09'x (HT) means either a tabulator (and nobody knows where the > tab positions are supposed to be) or a line-break, > In DOS, when you TYPE a file at the DOS prompt, a Tab character is expanded to enough blanks to bring us to the next tab stop, which are set according to the most common convention: 1, 9, 17, ... (1-based). > - '1A'x (SUB, aka Ctrl-Z) either means end of text, or a > right-pointing arrow; when it is used as an end-of-text marker, > the remainder of the storage block may contain arbitrary characters > with some programs and must contain '00'x with other programs (nice > feature when one of the former writes a file one of the latter is > supposed to read). > That's not a plain-text issue, it's a character encoding and file format issue. Ctrl-Z as an EOF indicator is a relic of CP/M, carried forward into DOS for compatibility, used by some apps and ignored by others. Two years ago I suggested that we come up with a standard for Unicode plain text that can be used as a baseline when converting files from DOS, UNIX, the Macintosh, etc, to Unicode, and that says what control characters (C0, C1, as well as Line Separator, Paragraph Separator, etc) mean in a plain-text file or data stream. We made some good progress but eventually the discussion fizzled out. If I can summarize it briefly: . Yes, but plain text in this sense is inadequate for representing (list of writing systems that need higher-level formatting assistance, rendering engines, etc.) . Fine, but they need that anyway. For many other languages, plain text is possible, and there should be no reason not to settle on a standard representation for it in those cases where it can be used. If anybody would like to revisit that discussion, I've uploaded it to: ftp://kermit.columbia.edu/kermit/e/plain.txt (about 300K of plain text :-) - Frank 2-Jul-99 7:37:25-GMT,6658;000000000011 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id DAA21905 for ; Fri, 2 Jul 1999 03:37:24 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id AAA206792 ; Fri, 2 Jul 1999 00:29:42 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA18813; Fri, 2 Jul 99 00:08:06 -0700 Message-Id: <9907020708.AA18813@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8285 (1999-07-02 07:07:55 GMT) From: Edward Cherlin To: Unicode List Cc: unicode@unicode.org Date: Fri, 2 Jul 1999 00:07:54 -0700 (PDT) Subject: Re: Plain Text At 15:32 -0700 6/30/1999, Frank da Cruz wrote: >> The only thing that is clear about "plain text" is that it is not well >> defined at all. My experience is that ASCII plain text is sufficiently well defined but has been incredibly badly implemented, due in part to the requirement in the 1960s and 1970s for keeping programs as small as possible, and in part to the rarity of cross-platform file transfer until the 1990s. The original definition, as John Cowan has pointed out, was anything a Teletype could reliably render, including overstrikes. Thinking of ASCII as printer commands rather than text makes it easier to understand the origins of its problems. (I have used printing terminals and video terminals that permitted overstrikes, designed for APL in particular and for what you will in general. Overstriking used to be taught in typing textbooks for creating signs like cent, c BS /. The problems we have with ASCII plain text come mainly from a small set of common variant practices. Using CR, LF, or CR/LF as a line or paragraph end Different tab spacings Optional line wrap Formfeed codes vs. computed page breaks BS = DEL or BS-overstrike In the past, editors on one platform, or written for one purpose, ignored all other practices. I use two text editors, Alpha for Macintosh and Notespad (note extra 's') for Windows, which can handle all of these variations according to my preferences, including the ability to read and write text files with Mac, Windows, or Unix line break codes. Notespad even maintains an extensible list of file types where line breaking is never to be changed by the editor (mostly programming language source code). Alpha asks whether to wrap paragraphs when opening files. >Actually, it tends to be well-defined for each platform. And then the >interchange methods among platforms tend to converge on a few simple >conventions: ASCII (or the appropriate ISO character set, or now UTF-8 or >other form of Unicode), as opposed to EBCDIC (or Baudot, or Sixbit); CRLFs >separating lines, and paragraphs separated by blank lines. Somewhat less >well defined, but nevertheless in common use, are bare Carriage Return or >Backspace for overstriking, Formfeed for "new page", and Tab for tabbing >(with several different conventions about tabstops). That is, we agree on everything except our variant usages. >Lines are terminated at somewhere between 72 and 80 characters by >convention, because that's how wide terminal screens are, and before them >the Teletype carriage, and before that the most common kind of punchcard. >Or for that matter, typewriters and sheets of paper (A4 or US, take your >pick :-) > >To this day, we follow these conventions in newsgroups and email, although >now it might be more a matter of "netiquette" than necessity (as in the >BITNET days, when e-mail was, quite literally, 80-column card images). As long as e-mail readers cannot correctly reformat messages with bad line breaks (like this), it will be a matter of real necessity. >These simple conventions let us format our text exactly the way we want to. >We can indent or not, we can put line breaks where we want them, we can have >columns of numbers or other tabular presentations, mathematical expressions, which actually require several hundred non-ASCII characters, unless you mean, as so many do, arithmetic expressions. >and idiosyncratic forms of emphasis. Many people want their text to stay >the way they wrote it. And many people also are not fond of receiving email >in every kind of bizarre format than any application developer can dream up >when it contains, in fact, nothing but words (but I stray). When I want my text to stay as I wrote it, I put it into a PDF, not a text file. Others prefer TeX for this purpose, or PostScript. >> I think the Unix community should slowly get used to the idea of >> abandoning LFs in the middle of paragraphs in plain text documents and >> let the editor and display tool perform the reformatting at display >> time. >> >But what IS plain text? Maybe some people might like to have their email >reformatted, but I don't think they want their C or Fortran or PostScript >programs to receive the same treatment. Nor, for that matter poetry or any >other forms of text where line breaks, indentation, and blank lines serve a >purpose. As in, for example, the preceding paragraph. Yes, it's that old Devil cross-cultural ignorance again. It wouldn't surprise me if some people here had never even read a Fortran program. >No more plain-text bashing! No more "legacy" saying! Our focus should be >not on stamping out plain text, but on promoting international multilingual >communication through a universal character set that does not impose a >a particular modus vivendi upon its users. > >- Frank We raised the question of defining a Unicode plain text format about two years ago, but nothing seemed to come of it. We also discussed the possibility of actually *using* Unicode text in this discussion, but nothing came of that either. Does anyone else here feel excessively constrained by our lack of glyphs for the characters we talk about? Would anyone else like to get UTF-8-capable mailers and extensive sets of Unicode fonts and see what effect they have on our deliberations? I have made the suggestion before, but here goes again--Alis Technologies offers a 30-day free trial period of its Tango Browser with Tango E-mail, downloadable from http://www.alis.com/internet_products/try_form.html. It runs on Windows 95, 98, and NT. Would anyone care to try it with me? -- Edward Cherlin President Coalition Against Unsolicited Commercial E-mail Help outlaw Spam. Talk to us at 2-Jul-99 16:04:55-GMT,11158;000000000001 Return-Path: Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id MAA17085; Fri, 2 Jul 1999 12:02:27 -0400 (EDT) Date: Fri, 2 Jul 99 12:02:27 EDT From: Frank da Cruz To: Edward Cherlin Subject: Re: Plain Text In-Reply-To: Your message of Fri, 2 Jul 1999 00:07:54 -0700 (PDT) Cc: unicode@unicode.org Message-ID: > The problems we have with ASCII plain text come mainly from a small set of > common variant practices. > > Using CR, LF, or CR/LF as a line or paragraph end > Different tab spacings > Optional line wrap > Formfeed codes vs. computed page breaks > BS = DEL or BS-overstrike > We all have dealt with these annoyances throughout our careers. They are indeed annoying, but not impassible impediments. Also, let's not mix up: . File storage format . Interchange format . Data entry format > Using CR, LF, or CR/LF as a line or paragraph end > As a line end: This is a file storage issue. As a paragraph end: There is no such thing as a paragraph end or paragraph separator in traditional plain text. Here I am sitting at my VT100 terminal, which is plugged in to my UNIX computer. I type: This is a line Then I push the Return key (sometimes marked Enter), which sends a Carriage Return. I would enter a line in exactly the same way no matter what computer was on the far end of the wire. Now: . The UNIX terminal driver turns the CR into a LF before giving it to the application. If the application is storing the line into a file, the file gets "This is a line". Ditto for some other operating systems, like AOS/VS. . If I had OS-9 on the far end, it would store "This is a line". . If I had TOPS-10, TOPS-20, RT-11, etc, on the far end, it would store "This is a line". . If I had VMS, VOS, VM/CMS, MVS/TSO or other complex file system on the far end, who knows how the line would be stored -- it depends on chosen the file organization and record format. The point is, it doesn't matter. Each platform has its own format for internal use, but a standardized interface to the outside world. To further demonstrate this fact, if I then tell the computer on the far end to "type" or "cat" the file, it will, invariably, send: This is a line So who cares what the file format is -- except of course when we want to transfer the file to another platform. In that case, it is the responsibility of each file-transfer agent to convert between its peculiar local format and the common one. And that is exactly what they do, just as is done at the terminal/terminal-driver/data-entry level. FTP and Kermit are two examples that show it is not that hard to convert plain-text file record formats from one platform to another. (And in Kermit's case, the character set too.) Of course life would have been simpler if there had been only ONE standard text-file format used on all platforms. But the early days of computing was a time of "Let the Hundred Flowers Bloom", and they did. Now, however, we are in a position to start over, and it is an opportunity we are not likely to have again. > Different tab spacings > I used to say this too, but the last platform I know about that did not assume tabstops at 1,9,17,25,... was MULTICS. Of course tabs are variable in word processors, etc, but that is not plain text. > Optional line wrap > This is a feature of the terminal or the application, not of "plain text". Files that do not contain line breaks and must rely on some form of postprocessing to insert line breaks at appropriate points is not really plain text, it is "input for a text formatter". Prior to the advent of word processors, the idea of "long line as paragraph" never came up. > Formfeed codes vs. computed page breaks > Page breaks are an issue worth discussing, and we discussed them at some length two years ago. Basically, you can let your "rendering engine" or printer driver insert them for you, or you can insert them yourself. One should be allowed the choice. (Why would anybody want "hard" page breaks? Because they are printing paychecks, invoices, envelopes, etc.) > BS = DEL or BS-overstrike > This is a data entry issue, unless you mean including BS in a file for overstriking. But in that case, there is never any confusion between BS and DEL, since DEL is never used for that purpose. In other words, the only confusion is at data entry, and this is entirely irrelevant to the definition of plain text. > >Lines are terminated at somewhere between 72 and 80 characters by > >convention, because that's how wide terminal screens are, and before them > >the Teletype carriage, and before that the most common kind of punchcard. > >Or for that matter, typewriters and sheets of paper (A4 or US, take your > >pick :-) > > > >To this day, we follow these conventions in newsgroups and email, although > >now it might be more a matter of "netiquette" than necessity (as in the > >BITNET days, when e-mail was, quite literally, 80-column card images). > > As long as e-mail readers cannot correctly reformat messages with bad > line breaks ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > (like this), it will be a matter of real necessity. > What does "correctly reformat messages" mean? How can your mail client read my mind? How does it know that the message I sent you was not already formatted exactly the way I wanted it? Notice that to illustrate my point, I need your original formatting (above) preserved, with the "> " quote indicators added at the left margin, and with my emphasis added under the appropriate words. What is a "correct" mail client supposed to do with this? Something like this?: > As long as e-mail readers cannot correctly reformat messages with bad > line breaks ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > (like this), it will be a matter of real necessity. No, a correct email client will leave it alone. Whether I want my email reformatted by your client should be my choice, since only I know what my intentions are in sending it. Granted, plain text requires some minimal level of agreement, for example that your screen is 72 (or 76, or 79) columns wide. I maintain that this convention is universal, except for Kanji, etc, which are displayed in two character cells each. People who use email, netnews, and other forms of open, interplatform communication have learned these conventions. We use them ourselves on this mailing list. Those of us who do not are often excoriated for our antisocial behavior. Especially when we send email or netnews in some application-specific format, assuming that everybody else uses the same platform and applications we do. > >These simple conventions let us format our text exactly the way we want > >to. We can indent or not, we can put line breaks where we want them, we > >can have columns of numbers or other tabular presentations, mathematical > >expressions, > > which actually require several hundred non-ASCII characters, unless you > mean, as so many do, arithmetic expressions. > Yes, that's what I meant, thanks. (All of us here recognize the shortcomings of ASCII -- that's why we're here! But let's not forget that ASCII can be used to write, say, Fortran programs that can handle far more in the way of mathematics than the repertoire of ASCII might suggest, and that people send Fortran-like expressions back and forth in email, etc, which could easily lose their meaning when reformatted.) > When I want my text to stay as I wrote it, I put it into a PDF, not a text > file. Others prefer TeX for this purpose, or PostScript. > My point exactly. And how do I read your PDF if I don't have a PDF reader? (Don't say "get one" -- I'm reading your mail on a DOS PC or a PDP-11, or a Cray supercomputer.) How do I read TeX if I don't have the software? How do I read PostScript if I don't have a PostScript printer or rendering engine. But the crucial point is: How will I read your PDF file 200 years from now, when PDF itself has been consigned to the "legacy" trashheap for the past 195 years? > We raised the question of defining a Unicode plain text format about two > years ago, but nothing seemed to come of it. > Then let's try again. Let me get the ball rolling with the following simple suggestion for Unicode Plain-Text File and Interchange Format: A monospaced character-cell display device is assumed for the purposes of line breaking. Characters that are too wide for a character cell (such as Kanjis) occupy a double-width cell. Of course, Unicode Plain Text can also be displayed on any other kind of device, in any font, monospaced or not, in which case "all bets are off", just as they are now with traditional plain text when displayed in a proportional font. Conversely, it is recognized that a monospaced (or duospaced) character-cell device might be inadequate for display of certain writing systems, such as Arabic or Indic scripts, and in this case intelligent rendering engines might very well be required. This should, nevertheless, be possible with plain text, without the aid of any particular markup scheme. Plain text is composed only of Unicode characters, with no meta-level of formatting information, presentation hints, etc, except: 1. Spaces, such as U+0020 and U+00A0, which are are "kept" (e.g. adjacent spaces are not collapsed). 2. Horizontal Tabs are indicated by the HT character, U+0009. Tab stops shall be assumed every 8 columns, starting at the first. (This provision is primarily to facilitate conversion of ASCII and 8-bit text to Unicode. Alternatively, it would be OK to force all horizontal alignment to be accomplished by spaces.) 3. Line breaks are indicated by Line Separator, U+2028. Preformatted text must break lines at column 79 or less to avoid unwanted reformatting. Column numbers are 1-based, relative to the left or right margin, according to the previaling directionality, with single-width characters as the counting unit. A line break is required at the end of the final line if it is to be considered a line. (This is to allow append operations to work in the expected fashion.) 4. Paragraph breaks are indicated by two successive Line Separators or by Paragraph Separator, U+2029. 5. Hard page breaks are indicated by FF, U+000C. C0 and C1 control characters other than HT and FF have no function whatsoever in Unicode Plain Text. (If there were Unicode Horizontal Tab and Page Break characters, we wouldn't need C0 at all; however, the UTC -- or at least members of it, in previous discussions -- indicated that there is no good reason to duplicate the C0 characters that are already in Unicode.) A Unicode plain-text "rendering engine" shall not mess with the format of a plain-text file except, optionally, at the user's discretion, to wrap lines that are longer than the display or printing device. Higher-level rendering engines, of course, can do whatever they want. - Frank 2-Jul-99 16:32:42-GMT,2273;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id MAA25758 for ; Fri, 2 Jul 1999 12:32:41 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id JAA248914 ; Fri, 2 Jul 1999 09:27:18 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA21218; Fri, 2 Jul 99 09:18:02 -0700 Message-Id: <9907021618.AA21218@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8293 (1999-07-02 16:17:51 GMT) From: Frank da Cruz To: Unicode List Date: Fri, 2 Jul 1999 09:17:48 -0700 (PDT) Subject: Plain text: Amendment 1 90 seconds later... 1. Spaces, such as U+0020 and U+00A0, which are are "kept" (e.g. adjacent spaces are not collapsed). 2. Horizontal Tabs are indicated by the HT character, U+0009. Tab stops shall be assumed every 8 columns, starting at the first. (This provision is primarily to facilitate conversion of ASCII and 8-bit text to Unicode. Alternatively, it would be OK to force all horizontal alignment to be accomplished by spaces.) 3. Line breaks are indicated by Line Separator, U+2028. Preformatted text must break lines at column 79 or less to avoid unwanted reformatting. Column numbers are 1-based, relative to the left or right margin, according to the previaling directionality, with single-width characters as the counting unit. A line break is required at the end of the final line if it is to be considered a line. (This is to allow append operations to work in the expected fashion.) 4. Paragraph breaks are indicated by two successive Line Separators or by Paragraph Separator, U+2029. 5. Hard page breaks are indicated by FF, U+000C. Change (4) to: 4. Paragraph breaks are indicated by Paragraph Separator, U+2029. Add to (3): A blank line is indicated by two successive Line Separators. Two blank lines are indicated by three of them, etc. This is to allow paragraphs like this one, which contain embedded "displays" set off by blank lines that are NOT paragraph separators. - Frank 2-Jul-99 17:17:52-GMT,4232;000000000011 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id NAA07783 for ; Fri, 2 Jul 1999 13:17:51 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id KAA281172 ; Fri, 2 Jul 1999 10:08:26 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA21632; Fri, 2 Jul 99 09:58:39 -0700 Message-Id: <9907021658.AA21632@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8294 (1999-07-02 16:58:29 GMT) From: Geoffrey Waigh To: Unicode List Date: Fri, 2 Jul 1999 09:58:27 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 7bit Frank da Cruz wrote: > > Then let's try again. Let me get the ball rolling with the following simple > suggestion for Unicode Plain-Text File and Interchange Format: > > A monospaced character-cell display device is assumed for the purposes of > line breaking. Characters that are too wide for a character cell (such as > Kanjis) occupy a double-width cell. Of course, Unicode Plain Text can also > be displayed on any other kind of device, in any font, monospaced or not, in > which case "all bets are off", just as they are now with traditional plain > text when displayed in a proportional font. Why are you specifying font characteristics for plain text? > Conversely, it is recognized that a monospaced (or duospaced) character-cell > device might be inadequate for display of certain writing systems, such as > Arabic or Indic scripts, and in this case intelligent rendering engines > might very well be required. This should, nevertheless, be possible with > plain text, without the aid of any particular markup scheme. And then saying that you don't really need a monospace font and it is still plain text even when you have to do a proper job of rendering it? > > Plain text is composed only of Unicode characters, with no meta-level > of formatting information, presentation hints, etc, except: > > 1. Spaces, such as U+0020 and U+00A0, which are are "kept" (e.g. > adjacent spaces are not collapsed). I don't see how barring all the other spacing and presentation codes (e.g. ZWNJ) improves plain text. > > 2. Horizontal Tabs are indicated by the HT character, U+0009. Tab > stops shall be assumed every 8 columns, starting at the first. (This > provision is primarily to facilitate conversion of ASCII and 8-bit > text to Unicode. Alternatively, it would be OK to force all > horizontal alignment to be accomplished by spaces.) > > 3. Line breaks are indicated by Line Separator, U+2028. Preformatted > text must break lines at column 79 or less to avoid unwanted > reformatting. Column numbers are 1-based, relative to the left or > right margin, according to the previaling directionality, with > single-width characters as the counting unit. A line break is > required at the end of the final line if it is to be considered a > line. (This is to allow append operations to work in the expected > fashion.) I don't see how specifying the maximum text width is in the purview of "plain text." That is suggesting that running my terminal in 132 column mode (or printing on wide paper/with narrow fonts,) involves something special. I suspect that all the attention to cell widths, column counting and what not is to make tab processing map nicely to the character cell terminal model. That model is responsible for some horrible hacks when it migrated to other countries and I believe the difficulties in adapting software that depends on it to writing systems it does not work for has been a serious drag on more advanced Unicode implementations. > > 4. Paragraph breaks are indicated by two successive Line Separators > or by Paragraph Separator, U+2029. If we are supporting Unicode and have a notion of Paragraph it seems reasonable to specify it is denoted with U+2029. > > 5. Hard page breaks are indicated by FF, U+000C. Geoffrey 2-Jul-99 18:15:24-GMT,5607;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA24570 for ; Fri, 2 Jul 1999 14:15:23 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id LAA270448 ; Fri, 2 Jul 1999 11:10:54 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA22476; Fri, 2 Jul 99 10:54:55 -0700 Message-Id: <9907021754.AA22476@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8297 (1999-07-02 17:54:45 GMT) From: Frank da Cruz To: Unicode List Cc: unicode@unicode.org Date: Fri, 2 Jul 1999 10:54:44 -0700 (PDT) Subject: Re: Plain Text > Why are you specifying font characteristics for plain text? > Only for purposes of getting across the idea that "long line = paragraph, break where you please" should not be considered well-formed plain text. Or, to look at it the other way, that plain text must allow for hard line breaks, and there should be a convention as to how long we might reasonably expect lines to be. "Columns" are the only measurement that makes sense (surely not picas, inches, millimeters, pixels, ...) and this presupposes fixed spacing. This might be a farfetched notion except that it is completely consonent with current practice. The fact that monospaced fonts have fallen out of fashion should not cloud our judgement. Naturally they present some difficulties for multilingual text, but they also provide numerous benefits. They let me compose a text document that anybody can read in -- barring "rendering engine" interference -- the same form in which I composed it. Tables line up, columns of numbers add up, comments in my C program are aligned, etc. All this without our having to agree in advance on which rendering engine or markup language to use. Parenthetically, look at the mess the craze for the typeset appearance has gotten us into. If I want to make a table on a Web page or in a typeset document, I have to use some kind of markup language or "table" package, rather than just spacing or tabbing the items appropriately. Which is fine until you consider that any markup language or tables package you are using today will be long forgotten a few years from now, and so your laboriously constructed document will either require conversion or be lost forever (or humans will need to read the markup language directly). As noted, I grant that the monospace-font model does not apply equally well to all writing systems, but for the many to which it does apply -- Roman, Hebrew, Cyrillic, Armenian, Greek, Georgian, etc, and to some extent CJK since, at least in Japan, they have been using mono- and duospaced fonts on terminals and PCs for decades, and care as much about things lining up as anybody else -- should guidelines not be stated up front? > > 1. Spaces, such as U+0020 and U+00A0, which are are "kept" (e.g. > > adjacent spaces are not collapsed). > > I don't see how barring all the other spacing and presentation codes > (e.g. ZWNJ) improves plain text. > They aren't barred -- they are Unicode characters that are not C0 or C1 control characters. And they aren't a higher-level markup language. > I don't see how specifying the maximum text width is in the purview of > "plain text." That is suggesting that running my terminal in 132 column > mode (or printing on wide paper/with narrow fonts,) involves something > special. I suspect that all the attention to cell widths, column > counting and what not is to make tab processing map nicely to the > character cell terminal model. That model is responsible for some > horrible hacks when it migrated to other countries and I believe the > difficulties in adapting software that depends on it to writing systems > it does not work for has been a serious drag on more advanced Unicode > implementations. > I suppose you're right about the intention. That's what the discussion is for -- to find suitable language for expressing a model for "text that is already formatted and stands on its own without additional formatting from any higher intelligence and that can displayed by the most minimalistic plain-text viewer", like this email message. You might be right about specifying a maximum line length. And yet, if there is to be such a thing as preformatted plain text, and none of us can deny that there already is such a thing since this is how we commicate, should there not be some form of guideline as to what is a safe default line-length, in the absence of any prior agreement to set a different one? That's what we do now, implicitly. Why not make it explicit? So how should the guideline be expressed? Let's assume you are composing some plain text, and you don't care how it's rendered. Then don't include Line Separators and let the viewer "flow" the text. That's fine for ordinary prose, but it assumes a viewer that knows how to flow text, and I'm not sure that a text-flowing viewer should be assumed or required. As somebody mentioned earlier, most printers will truncate long lines, as will many terminals and other display devices. If you do care how the text is rendered, include Line Separators. > > 4. Paragraph breaks are indicated by two successive Line Separators > > or by Paragraph Separator, U+2029. > > If we are supporting Unicode and have a notion of Paragraph it seems > reasonable to specify it is denoted with U+2029. > Agreed and amended already. - Frank 2-Jul-99 18:32:33-GMT,4626;000000000001 Return-Path: Received: from inergen.sybase.com (inergen.sybase.com [192.138.151.43]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA29877 for ; Fri, 2 Jul 1999 14:32:32 -0400 (EDT) Received: from smtp1.sybase.com (sybgate.sybase.com [130.214.220.35]) by inergen.sybase.com with ESMTP id LAA07740; Fri, 2 Jul 1999 11:33:15 -0700 (PDT) Received: from birdie.sybase.com (birdie.sybase.com [130.214.140.3]) by smtp1.sybase.com with SMTP id LAA03792; Fri, 2 Jul 1999 11:32:11 -0700 (PDT) Received: by birdie.sybase.com (5.x/SMI-SVR4/SybEC3.5) id AA03974; Fri, 2 Jul 1999 11:32:11 -0700 Date: Fri, 2 Jul 1999 11:32:11 -0700 From: kenw@sybase.com (Kenneth Whistler) Message-Id: <9907021832.AA03974@birdie.sybase.com> To: fdc@watsun.cc.columbia.edu Subject: Re: Plain text: Amendment 1 Cc: unicode@unicode.org, kenw@sybase.com X-Sun-Charset: US-ASCII The problem I am having with Frank's suggestions boil down essentially this: The Unicode concept of plain text is of a text stream consisting only of Unicode characters, interpreted according to the rules of the standard, and not including (or not interpreting the inclusion of) higher-level markup, however expressed. It does not involve specification of particular font behavior (including monospacing), details of terminal interaction, or line length. It is that concept of Unicode plain text that we intend and hope will be stable for the next century. Given the text stream itself, basic textual content should be derivable, although not necessarily any detailed layout information. The intended invariant is textual content, rather than document form including textual content. To specify invariant document form, it is clear that a higher-level protocol must be specified. And I see Frank's Unicode plain text proposal as just the bare-bottom, minimal common denominator for a document description standard. In that respect it is no different from PDF, except in complexity and faithfulness to original appearance of a document in all details. Some of the difficulty of this discussion, of course, derives from the fact that the Unicode Standard unavoidably had to contain some bare minimum of format control characters. We have had to specify format semantics for CR, LF, TAB, VT, FF because there was no way we were going to get from the past to the future without people converting existing documents using these (or carrying analogous practice into new documents); and LS and PS were added to provide a minimum, unambiguous set of format controls to organize plain text. Bidi format controls were added because they had to be: otherwise, you run into situations where intended content is inexpressible, or existing content is uninterpretable in plain text. And on the other hand, the situation is muddied by plain text markup conventions where the markup is carried around in the plain text: 9/23/98 38 widgets sold 65,416 --- 65,416 Where the "plain text" is: "NLF9/23/98NLF38 widgets soldNLF65,416NLF---NLF65,416NLF" But the plain text of the content is 5 strings: "9/23/98" "38 widgets sold" "65,416" "---" "65,416" And the full document desription is, of course, not just these 5 strings, but includes the fact that they constitute a row embedded in a table, and are aligned in specified ways within the cells in that row. The Unicode vision is that the character encoding standard itself should be as robust and useful in its larger domain as the 7-bit ASCII standard was in its own contrained textual domain. But given the enormous complexities that are inherent in trying to deal with *all* of the writing systems of the world, it is inevitable that plain text *layout* conventions involving Unicode are going to be considerably more complex than plain text *layout* conventions involving ASCII only. At the bare minimum, for example, plain text in Unicode *must* take bidirectional layout into account--otherwise, you would be saying that you could express Unicode content in plain text, as long as you avoided Hebrew, Arabic, and Syriac characters. In some respects, the entire content of the Unicode Standard beyond just the code charts and names lists is an elaborate attempt to describe what it means to deal with plain text layout and interpretation for all of the Unicode characters. It cannot be encapsulated in the kind of constraints that Frank has suggested, in my opinion. --Ken 2-Jul-99 18:51:30-GMT,5169;000000000011 Return-Path: Received: from mail.rdc1.bc.home.com (ha1.rdc1.bc.wave.home.com [24.2.10.66]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA05270 for ; Fri, 2 Jul 1999 14:51:28 -0400 (EDT) Received: from home.com ([24.113.28.108]) by mail.rdc1.bc.home.com (InterMail v4.01.01.00 201-229-111) with ESMTP id <19990702185120.ZXVS29070.mail.rdc1.bc.home.com@home.com>; Fri, 2 Jul 1999 11:51:20 -0700 Message-ID: <377D0A96.86F53390@home.com> Date: Fri, 02 Jul 1999 11:53:10 -0700 From: Geoffrey Waigh X-Mailer: Mozilla 4.5 [en] (Win98; I) X-Accept-Language: en MIME-Version: 1.0 To: unicode@unicode.org Subject: Re: Plain Text References: Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Frank da Cruz wrote: > > > Why are you specifying font characteristics for plain text? > > > Only for purposes of getting across the idea that "long line = paragraph, > break where you please" should not be considered well-formed plain text. > Or, to look at it the other way, that plain text must allow for hard line > breaks, and there should be a convention as to how long we might reasonably > expect lines to be. "Columns" are the only measurement that makes sense > (surely not picas, inches, millimeters, pixels, ...) and this presupposes > fixed spacing. See below for comments on maximum line length. When considering why other measurements were inappropriate I realized it is because "preformatted" plain text has no control over font size and thus cannot do position based formatting as someone would do on a sheet of paper. The cell model allows people to position text without recourse to a markup system but at the sacrafice of which scripts can be properly rendered. It happens that many of the commercially significant languages can cope with the cell model which is part of the reason it has survived so long. Unfortunately it just helps keep the hard writing systems in the ghetto because it isn't nearly as profitable and requires dealing with many cans of worms when trying to fit them to a system that depends on implicit positioning. > The fact that monospaced fonts have fallen out of fashion should not cloud > our judgement. Naturally they present some difficulties for multilingual > text, but they also provide numerous benefits. They let me compose a text > document that anybody can read in -- barring "rendering engine" interference > -- the same form in which I composed it. Tables line up, columns of numbers > add up, comments in my C program are aligned, etc. All this without our > having to agree in advance on which rendering engine or markup language to > use. Presumably the markup language specifies the semantics well enough to be rendering engine independent - if the rendering engine is capable of displaying the text as described. For text that is being sent without any markup, then monospace for the bulk of the text is probably what the reader should use (at least if they believe the text to have horizontal structure.) I just don't think that it should be enforced. As for the concerns about the ephemeral nature of markup languages, hopefully we will someday reach some stability for systems that don't require a proprietary encoder, do not require extensive computer training to grok and do not have flavour of the week problems. These difficulties are not inherent in the design of markup languages but an artifact of the political and economic forces driving them. > You might be right about specifying a maximum line length. And yet, > if there is to be such a thing as preformatted plain text, and none of us > can deny that there already is such a thing since this is how we commicate, > should there not be some form of guideline as to what is a safe default > line-length, in the absence of any prior agreement to set a different one? > That's what we do now, implicitly. Why not make it explicit? So how should > the guideline be expressed? Because if it is made explicit, software writers will feel free to take such a limit as a hard one and do silly things for text that exceeds it. Right now most software will handle long lines albeit sometimes awkwardly. If someone preformats their text for 200 columns, then that is what they should get if the output device can cope. If it cannot, they need to consider why they think it has to be 200 columns. In the case of Usenet and public mailing lists people have to curtail their lines if they don't want them mangled. > Let's assume you are composing some plain text, and you don't care how it's > rendered. Then don't include Line Separators and let the viewer "flow" the > text. That's fine for ordinary prose, but it assumes a viewer that knows > how to flow text, and I'm not sure that a text-flowing viewer should be > assumed or required. As somebody mentioned earlier, most printers will > truncate long lines, as will many terminals and other display devices. > > If you do care how the text is rendered, include Line Separators. I agree with this. Geoffrey 2-Jul-99 20:08:21-GMT,3467;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA25355 for ; Fri, 2 Jul 1999 16:08:20 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id MAA91712 ; Fri, 2 Jul 1999 12:56:37 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA24684; Fri, 2 Jul 99 12:47:49 -0700 Message-Id: <9907021947.AA24684@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8306 (1999-07-02 19:47:40 GMT) From: Frank da Cruz To: Unicode List Cc: unicode@unicode.org Date: Fri, 2 Jul 1999 12:47:39 -0700 (PDT) Subject: Re: Plain Text OK, then perhaps the idea of "recommended maximum line length" is an unnecessary complication. Perhaps it is enough to say that Line Separator means what it says. If I put one in my text, then it means to start a new line. If I make sure that there are no more than 79 characters between line separators (or whatever else is appropriate to my writing system), I'll get the desired effect. > As for the concerns about the ephemeral nature of markup languages, > hopefully we will someday reach some stability for systems that > don't require a proprietary encoder, do not require extensive > computer training to grok and do not have flavour of the week > problems. These difficulties are not inherent in the design of > markup languages but an artifact of the political and economic > forces driving them. > Right, of course. But we can we trust the market to settle on a simple standard for plain text? Of course not; there's no money in it. Does the market want an immutable standard for plain-text documents that can last for a century or an eon? Of course not. The market wants everything to change all the time, so everybody will have to "upgrade" constantly. That's great for business but bad for preservation of history and culture. And it shortens the productive lives of "content providers". There are ways to make money that don't require artificially induced instability. Furthermore, I would not like to think that in the Unicode world of the future, that it will not be possible to send preformatted email or netnews without the assistance of some specific markup language or embedded proprietary word-processor codes. Email has already deteriorated significantly from its original openness thanks to MIME's blessing of any kind of proprietary gewgaw any vendor wants to add to their GUI email clients. Thus a perfect application for Unicode plain text would be as a MIME type, specifically intended to proclaim and promote the adherence to a simple, universal, vendor-independent, self-contained standard. Hopefully the IETF would have the sense to see the value of a Unicode successor to RFC822. So I'd like to see a definition for plain text in the Unicode standard, that is totally independent of any external product, that allows a file or stream of Unicode text to stand on its own, for all time, and retain a minimum level of formatting, in those cases where the author of the text feels formatting is important. (In fact, all of us do, otherwise we wouldn't care so much about fonts and rendering engines and markup languages). I think email and netnews are two areas where the need for such a standard is evident. - Frank 2-Jul-99 20:31:40-GMT,1251;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA01309 for ; Fri, 2 Jul 1999 16:31:39 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id NAA185114 ; Fri, 2 Jul 1999 13:26:21 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA25362; Fri, 2 Jul 99 13:17:37 -0700 Message-Id: <9907022017.AA25362@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-Uml-Sequence: 8308 (1999-07-02 20:17:28 GMT) From: "Paul Dempsey (Exchange)" To: Unicode List Date: Fri, 2 Jul 1999 13:17:27 -0700 (PDT) Subject: RE: Plain Text This would be a fine standard. However, it doesn't have to be part of the _Unicode_ standard, and I don't think it belongs as a normative part of Unicode. As minimal as it may be, it still falls into the domain of file formats and "higher-level protocol". It's a tribute to the success of Unicode that people want to piggyback on it's success to solve closely related problems. --- Paul 2-Jul-99 23:30:54-GMT,1326;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA29893 for ; Fri, 2 Jul 1999 19:30:53 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id QAA194924 ; Fri, 2 Jul 1999 16:24:01 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA28898; Fri, 2 Jul 99 16:10:32 -0700 Message-Id: <9907022310.AA28898@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8321 (1999-07-02 23:10:02 GMT) From: John Cowan To: Unicode List Date: Fri, 2 Jul 1999 16:10:01 -0700 (PDT) Subject: Re: Plain text: Amendment 1 Content-Transfer-Encoding: 7bit Frank da Cruz scripsit: > This is to allow paragraphs like this one, which contain embedded > "displays" set off by blank lines that are NOT paragraph separators. A great thing. It is only in plain text that I can compare 1) example A with 2) example B in a single paragraph without confusion. -- John Cowan cowan@ccil.org I am a member of a civilization. --David Brin 3-Jul-99 0:50:17-GMT,1114;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id UAA08384 for ; Fri, 2 Jul 1999 20:50:16 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id RAA339856 ; Fri, 2 Jul 1999 17:46:29 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA00424; Fri, 2 Jul 99 17:34:34 -0700 Message-Id: <9907030034.AA00424@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-Uml-Sequence: 8325 (1999-07-03 00:34:07 GMT) From: "Christopher J. Fynn" To: Unicode List Date: Fri, 2 Jul 1999 17:34:05 -0700 (PDT) Subject: RE: Plain Text [**NOT**] Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by watsun.cc.columbia.edu id UAA08384 Edward Cherlin wrote: > I know of no device which required the user to enter a CR followed > by an LF The manual typewriter? - Chris 3-Jul-99 1:13:59-GMT,1390;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id VAA09857 for ; Fri, 2 Jul 1999 21:13:59 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id SAA251700 ; Fri, 2 Jul 1999 18:07:10 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA00795; Fri, 2 Jul 99 17:51:32 -0700 Message-Id: <9907030051.AA00795@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8326 (1999-07-03 00:51:19 GMT) From: kenw@sybase.com (Kenneth Whistler) To: Unicode List Cc: unicode@unicode.org, kenw@sybase.com Date: Fri, 2 Jul 1999 17:51:17 -0700 (PDT) Subject: RE: Plain Text [**NOT**] Chris Fynn suggested: > > Edward Cherlin wrote: > > > I know of no device which required the user to enter a CR followed > > by an LF > > The manual typewriter? Hehe, not even that, since when you pull the "carriage return lever" to return the carriage to the left margin, the ratchet setting (for single space or double space) automatically feeds the line (or lines) on the platen to the ratchet stop before the lever locks and allows you to drag the carriage back. So nice try, but CRLF was already mechanically automated decades ago. --Ken > > - Chris > 3-Jul-99 3:06:18-GMT,1372;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA20716 for ; Fri, 2 Jul 1999 23:06:17 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id UAA266332 ; Fri, 2 Jul 1999 20:01:53 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA01966; Fri, 2 Jul 99 19:48:33 -0700 Message-Id: <9907030248.AA01966@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain X-Uml-Sequence: 8330 (1999-07-03 02:48:22 GMT) From: "Hohberger, Clive P." To: Unicode List Date: Fri, 2 Jul 1999 19:48:20 -0700 (PDT) Subject: RE: Plain Text [**NOT**] The Teletypes did, up through at least the KSR 33 and ASR 35, at least. That's why CR and LF were made part of the control character set... along with alot of other Teletype commands (SI, SO, HT, etc) Clive > -----Original Message----- > From: Christopher J. Fynn [SMTP:cfynn@dircon.co.uk] > Sent: Friday, July 02, 1999 7:34 PM > To: Unicode List > Subject: RE: Plain Text [**NOT**] > > Edward Cherlin wrote: > > > I know of no device which required the user to enter a CR followed > > by an LF > > The manual typewriter? > > - Chris 3-Jul-99 3:41:00-GMT,1740;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA24627 for ; Fri, 2 Jul 1999 23:40:59 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id UAA268562 ; Fri, 2 Jul 1999 20:33:54 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA02522; Fri, 2 Jul 99 20:21:14 -0700 Message-Id: <9907030321.AA02522@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8332 (1999-07-03 03:20:58 GMT) From: Edward Cherlin To: Unicode List Date: Fri, 2 Jul 1999 20:20:57 -0700 (PDT) Subject: Re: Plain Text At 11:45 -0700 7/2/1999, Geoffrey Waigh wrote: >Frank da Cruz wrote: >> >> > Why are you specifying font characteristics for plain text? >> > >> Only for purposes of getting across the idea that "long line = paragraph, >> break where you please" should not be considered well-formed plain text. >> Or, to look at it the other way, that plain text must allow for hard line >> breaks, and there should be a convention as to how long we might reasonably >> expect lines to be. [much snippage] There cannot be an enforceable line length limit on plain text. One of the uses of plain text is for database interchange, where any number of fields of any length, plus separators, may constitute a line. -- Edward Cherlin President Coalition Against Unsolicited Commercial E-mail Help outlaw Spam. Talk to us at 3-Jul-99 3:46:36-GMT,22091;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA25014 for ; Fri, 2 Jul 1999 23:46:35 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id UAA186618 ; Fri, 2 Jul 1999 20:35:16 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA02526; Fri, 2 Jul 99 20:21:17 -0700 Message-Id: <9907030321.AA02526@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8333 (1999-07-03 03:21:01 GMT) From: Edward Cherlin To: Unicode List Cc: unicode@unicode.org Date: Fri, 2 Jul 1999 20:20:59 -0700 (PDT) Subject: Re: Plain Text At 08:58 -0700 7/2/1999, Frank da Cruz wrote: [failing to mention that Ed Cherlin wrote:] >> The problems we have with ASCII plain text come mainly from a small set of >> common variant practices. >> >> Using CR, LF, or CR/LF as a line or paragraph end >> Different tab spacings >> Optional line wrap >> Formfeed codes vs. computed page breaks >> BS = DEL or BS-overstrike >> >We all have dealt with these annoyances throughout our careers. They are >indeed annoying, but not impassible impediments. Also, let's not mix up: > > . File storage format > . Interchange format > . Data entry format . Rendering options On looking through the remainder of this message, I conclude that I disagree with Frank's attempts to make his own limited experience normative, but I heartily agree that his proposal for a bottom-level plain text Unicode format is on the right track, and that it allows us to deal with some of the issues listed above as file format issues, specifically line and paragraph ends and other control codes. Tab stops, wrapping, and page breaking must be left to the user's choice when rendering, since they are not file format issues. >> Using CR, LF, or CR/LF as a line or paragraph end >> >As a line end: > This is a file storage issue. > >As a paragraph end: > There is no such thing as a paragraph end or paragraph separator in > traditional plain text. > >Here I am sitting at my VT100 terminal, which is plugged in to my UNIX >computer. Here *I* am, sitting at my Mac, and recalling what I have been doing on an NT system and Silicon Graphics Indy and O2 computers running Irix for the last year and a half, when I was shuttling files back and forth between them. (The Indy is used as an embedded controller in a 750 kg laser microscope system for semiconductor wafer inspection, and the O2 to run the microscope software without the hardware for demos and simulations, none of which matters to this discussion.) >I type: > > This is a line > >Then I push the Return key (sometimes marked Enter), which sends a Carriage >Return. Whereas my VT100 simulator used to get its CR from the keyboard buffer, where it was deposited after the keyboard driver translated from the keyboard scan codes. Anyway, input technology is not at issue here. >I would enter a line in exactly the same way no matter what >computer was on the far end of the wire. Now: > > . The UNIX terminal driver turns the CR into a LF before giving it > to the application. If the application is storing the line into a > file, the file gets "This is a line". Ditto for some other > operating systems, like AOS/VS. > > . If I had OS-9 on the far end, it would store "This is a line". ^or Mac OS > . If I had TOPS-10, TOPS-20, RT-11, etc, on the far end, it would > store "This is a line". > > . If I had VMS, VOS, VM/CMS, MVS/TSO or other complex file system on > the far end, who knows how the line would be stored -- it depends on > chosen the file organization and record format. > >The point is, it doesn't matter. Each platform has its own format for >internal use, but a standardized interface to the outside world. To further >demonstrate this fact, if I then tell the computer on the far end to "type" >or "cat" the file, it will, invariably, send: > > This is a line Your cultural ignorance/sheltered life-experience is showing. *You* may live in an environment where these changes are made automatically, but a lot of us don't. >So who cares what the file format is -- except of course when we want to >transfer the file to another platform. And since I don't use a VT100 simulator anymore, I only encounter this issue when transfering files to another platform, and as a result I care all the time. >In that case, it is the >responsibility of each file-transfer agent When reading floppy disks? >to convert between its peculiar >local format and the common one. And that is exactly what they do, just >as is done at the terminal/terminal-driver/data-entry level. FTP and Kermit >are two examples that show it is not that hard to convert plain-text file >record formats from one platform to another. (And in Kermit's case, the >character set too.) > >Of course life would have been simpler if there had been only ONE standard >text-file format used on all platforms. But the early days of computing >was a time of "Let the Hundred Flowers Bloom", and they did. Now, however, >we are in a position to start over, and it is an opportunity we are not >likely to have again. Yes, yes, everything *could* have been made to work, except for the parts that couldn't, you see, because management wouldn't allow the extra time and space required to make things portable, or worse still, was trying to lock customers into proprietary data formats. >> Different tab spacings >> >I used to say this too, but the last platform I know about that did not >assume tabstops at 1,9,17,25,... was MULTICS. Of course tabs are variable >in word processors, etc, but that is not plain text. Your limited experience again. I have rarely used an editor with fixed tab stops since about 1982 (EDLIN, IIRC). I once knew the escape sequences for IBM, Diablo, and Qume *printing* terminal tab settings by heart. >> Optional line wrap >> >This is a feature of the terminal or the application, not of "plain text". This is a feature found in ASCII *files* which were written either with or without explicit line breaks, requiring a choice for appropriate rendering--a choice which the editor should be able to make, but which the user should actually make. >Files that do not contain line breaks and must rely on some form of >postprocessing to insert line breaks at appropriate points is not really >plain text, it is "input for a text formatter". But the text editor is frequently the chosen text reformatter. You are still claiming that text files as they occur in your computer subculture are for some reason normative for the rest of us. >Prior to the advent of >word processors, the idea of "long line as paragraph" never came up. Word processing began in the 1960s. I gather you had a later date in mind. Did you mean specifically WYSIWYG word processors, invented at Xerox in the late 1970s? >> Formfeed codes vs. computed page breaks >> >Page breaks are an issue worth discussing, and we discussed them at some >length two years ago. Basically, you can let your "rendering engine" or >printer driver insert them for you, or you can insert them yourself. One >should be allowed the choice. (Why would anybody want "hard" page breaks? >Because they are printing paychecks, invoices, envelopes, etc.) If we can establish that general principle and apply it to the previous cases, the problem will be solved in short order. The application determines the requirements for tab stops, page breaks, and paragraph or line formatting. >> BS = DEL or BS-overstrike >> >This is a data entry issue, unless you mean including BS in a file for >overstriking. But in that case, there is never any confusion between BS and >DEL, since DEL is never used for that purpose. In other words, the only >confusion is at data entry, and this is entirely irrelevant to the >definition of plain text. > >> >Lines are terminated at somewhere between 72 and 80 characters by >> >convention, because that's how wide terminal screens are, and before them >> >the Teletype carriage, and before that the most common kind of punchcard. >> >Or for that matter, typewriters and sheets of paper (A4 or US, take your >> >pick :-) >> > >> >To this day, we follow these conventions in newsgroups and email, although >> >now it might be more a matter of "netiquette" than necessity (as in the >> >BITNET days, when e-mail was, quite literally, 80-column card images). >> >> As long as e-mail readers cannot correctly reformat messages with bad >> line breaks ^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> (like this), it will be a matter of real necessity. >> >What does "correctly reformat messages" mean? How can your mail client read >my mind? How does it know that the message I sent you was not already >formatted exactly the way I wanted it? I mean that it should have the ability to reformat such badly broken text, to use when I decide. Right now I have to reformat such text by hand, or leave it severely broken. Well, maybe I should learn Perl, but I prefer that someone else learn Perl and write the routines I and many others need. If any reader is interested, the spec is as follows. 1) Reflow paragraphs, removing extra white space, while preserving quoting marks '>' in the left margin. Don't get confused by angle brackets in the text. 2) Realign tables with "tab damage". Tables that are too wide should be broken into pages, rather than having lines folded. If you can manage those two, you're good, and I have some more little jobs for you. E-mail users will be eternally grateful (for a week or two, anyway, on Net time). >Notice that to illustrate my point, I need your original formatting (above) >preserved, with the "> " quote indicators added at the left margin, and with >my emphasis added under the appropriate words. What is a "correct" mail >client supposed to do with this? Something like this?: > > > As long as e-mail readers cannot correctly > reformat messages with bad > line breaks > ^^^^^^^^^^^^^^^^^^^^^^^^^^^ > (like this), > it will be a matter of real necessity. > >No, a correct email client will leave it alone. Whether I want my email >reformatted by your client should be my choice, since only I know what my >intentions are in sending it. ^^^^^^^^^ However, it actually is the recipient's choice, and you can't stop us. The "correct" reformatting I had in mind would look like this. >> As long as e-mail readers cannot correctly reformat messages with bad >> line breaks (like this), it will be a matter of real necessity. or possibly >>As long as e-mail readers cannot correctly reformat messages with bad >>line breaks (like this), it will be a matter of real necessity. (**my choice**) >Granted, plain text requires some minimal level of agreement, for example >that your screen is 72 (or 76, or 79) columns wide. I maintain that this >convention is universal, except for Kanji, etc, which are displayed in two >character cells each. People who use email, netnews, and other forms of >open, interplatform communication have learned these conventions. We use >them ourselves on this mailing list. Those of us who do not are often >excoriated for our antisocial behavior. Universal, of course, except where it isn't, you know. No matter where we set the right margin, text quoted from e-mails will break against it if it can't be reflowed. >Especially when we send email or netnews in some application-specific >format, assuming that everybody else uses the same platform and applications >we do. > >> >These simple conventions let us format our text exactly the way we want >> >to. We can indent or not, we can put line breaks where we want them, we >> >can have columns of numbers or other tabular presentations, mathematical >> >expressions, >> >> which actually require several hundred non-ASCII characters, unless you >> mean, as so many do, arithmetic expressions. >> >Yes, that's what I meant, thanks. (All of us here recognize the >shortcomings of ASCII -- that's why we're here! But let's not forget that >ASCII can be used to write, say, Fortran programs that can handle far more >in the way of mathematics than the repertoire of ASCII might suggest, and >that people send Fortran-like expressions back and forth in email, etc, >which could easily lose their meaning when reformatted.) How do you express a vector inner product in FORTRAN? In TeX it's something like $\Sigma_(i=0)^n a_i \times b_i$, and in APL it's nearly "A+.xB", but with a real times symbol. >> When I want my text to stay as I wrote it, I put it into a PDF, not a text >> file. Others prefer TeX for this purpose, or PostScript. >> >My point exactly. No, your point was that ASCII text files stay formatted the way you write them. That would be true, I suppose, if we agreed with you that we could outlaw differences in tab stops, line breaking, and other options on different platforms, because your subworld is normative and there aren't any variant practices worthy of consideration. >And how do I read your PDF if I don't have a PDF reader? >(Don't say "get one" -- I'm reading your mail on a DOS PC or a PDP-11, or a >Cray supercomputer.) Yes, we had the same problem with SGI Irix 5.2, which doesn't support a PDF reader. But the field engineers have Windows on their laptops, so it's only a problem for the user manual, not the service manual, and only becomes vitally important in paperless fabs. >How do I read TeX if I don't have the software? How >do I read PostScript if I don't have a PostScript printer or rendering >engine. But the crucial point is: > > How will I read your PDF file 200 years from now, when > PDF itself has been consigned to the "legacy" trashheap > for the past 195 years? along with ASCII, 8859, and 2022, and all of our removable storage media. Do you know someone with a functioning Teletype paper tape reader who can read legacy ASCII files from 1970? What would you suggest I archive my life's work on for the ages to come (if anyone cares)? >> We raised the question of defining a Unicode plain text format about two >> years ago, but nothing seemed to come of it. >Then let's try again. Let me get the ball rolling with the following simple >suggestion for Unicode Plain-Text File and Interchange Format: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The following discusses a file format and a number of rendering options, but fails to address interchange. UTF-8 is usually recommended for interchange, since it avoids the Endianness question, but transfer of files in other encodings will occur, and must be provided for. The file format must define permitted character codes and code sequences. I suggest that we permit any character code that can represent a character, even if no character is defined for that code, but that we not permit unmatched surrogate characters or codes which are defined not to have the possibility of representing a character. Error behavior for the rendering process when there are illegal codes or code sequences can be undefined, or we could specify error messages and continuation policies. The display rendering process does not change the file, so any display options such as word wrap, tab stops, character width, ligatures, combining characters, and so on are orthogonal to the file format. The user can change the text and save in the new form, but the software isn't allowed to on its own. Rendering behavior of control codes and other non-printing characters must be defined. >A monospaced character-cell display device is assumed for the purposes of >line breaking. Characters that are too wide for a character cell (such as >Kanjis) occupy a double-width cell. Users may choose to display all characters in cells of the same width, or to mix single- and double-cell display. Note that this is not the same as half-width and full-width CJK characters, which have been defined as separate characters. >Of course, Unicode Plain Text can also >be displayed on any other kind of device, in any font, monospaced or not, in >which case "all bets are off", just as they are now with traditional plain >text when displayed in a proportional font. Specifically, we will permit rendering in ATSUI on the Mac, in Java, on NT2K, in Plan 9, and on other platforms, all with whatever level of Unicode rendering and fonts happen to be available, and we will specify what should happen for missing characters, lack of BIDI capability, lack of ligatures, etc. >Conversely, it is recognized that a monospaced (or duospaced) character-cell >device might be inadequate for display of certain writing systems, such as >Arabic or Indic scripts, and in this case intelligent rendering engines >might very well be required. For some purposes a monospaced LTR rendering of these characters may be useful, and is permitted as a user option and as a fallback. >This should, nevertheless, be possible with >plain text, without the aid of any particular markup scheme. But with the use of Unicode markup characters, such as explicit ordering and joining characters. >Plain text is composed only of Unicode characters, ^printing ^including surrogate character pairs, >with no meta-level >of formatting information, presentation hints, etc, except: > > 1. Spaces, such as U+0020 and U+00A0, which are are "kept" (e.g. > adjacent spaces are not collapsed). including spaces defined at code points U+2000-U+200B. > 2. Horizontal Tabs are indicated by the HT character, U+0009. Tab > stops shall be assumed every 8 columns, starting at the first. (This > provision is primarily to facilitate conversion of ASCII and 8-bit > text to Unicode. Alternatively, it would be OK to force all > horizontal alignment to be accomplished by spaces.) As on a typewriter, we have no control of the user's tab stop settings. I recommend that we legislate alignment of monospaced text using spaces only, and forget HT. That's what I have taught people to do for tabular e-mail such as resumes. > 3. Line breaks are indicated by Line Separator, U+2028. Preformatted > text must break lines at column 79 or less to avoid unwanted > reformatting. At present software is free to truncate long lines, wrap at the last column, or word wrap. I would recommend that we forbid truncation and allow the user to choose wrapping style. >Column numbers are 1-based, relative to the left or > right margin, according to the previaling directionality, with > single-width characters as the counting unit. A line break is > required at the end of the final line if it is to be considered a > line. (This is to allow append operations to work in the expected > fashion.) > > 4. Paragraph breaks are indicated by two successive Line Separators legacy, deprecated in new software > or by Paragraph Separator, U+2029. > > 5. Hard page breaks are indicated by FF, U+000C. 6. BIDI modifiers: U+200E, LEFT-TO-RIGHT MARK; U+200F, RIGHT-TO-LEFT MARK 7. Joining modifiers: U+200C, ZERO-WIDTH NON-JOINER; U+200D ZERO-WIDTH JOINER 8. Combining characters: numerous accents; vowels in Hebrew, Arabic, Indic scripts, etc. 9. FEFF ZERO-WIDTH NO-BREAK SPACE=BYTE ORDER MARK should be the first character in a Unicode text file in 16-bit encoding (is that UTF-16? I can't keep them all straight.) BOM is not required in UTF-8 encoding. Non-normative comment: >C0 and C1 control characters other than HT and FF have no function >whatsoever in Unicode Plain Text. (If there were Unicode Horizontal Tab and >Page Break characters, we wouldn't need C0 at all; however, the UTC -- or at >least members of it, in previous discussions -- indicated that there is no >good reason to duplicate the C0 characters that are already in Unicode.) End comment. >A Unicode plain-text "rendering engine" shall not mess with the format of a \\\\\\\\\change >plain-text file except, optionally, at the user's discretion, to wrap lines \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\. It may on the display >that are longer than the display or printing device. Higher-level rendering ^line length >engines, of course, can do whatever they want. And plain text can contain any markup for such engines using Unicode characters that is defined for a specific use, such as HTML, TeX source code, RTF, etc. >- Frank Ed The following non-printing characters may occur in the file, but will be treated as unavailable characters. U+206A INHIBIT SYMMETRIC SWAPPING U+206B ACTIVATE SYMMETRIC SWAPPING U+206C INHIBIT ARABIC SHAPING U+206D ACTIVATE ARABIC SHAPING U+206E NATIONAL DIGIT SHAPES U+206F NOMINAL DIGIT SHAPES Unicode Standard 2.0 describes them as "Alternate format characters (usage strongly discouraged)" Behavior for unavailable characters should be defined. Options include a single glyph for any unavailable character, glyphs indicating the code block of unavailable characters, and numeric rendering. Behavior for non-printing characters with no semantic significance in plain text should be defined. Should they be treated as unavailable characters, or as though they aren't there? A growing number of standards specify the use of Unicode text files, without explicitly defining them. If we get anywhere with this, we will have to run our proposal past these other groups, including the IETF, the POSIX committee, programming language standards committees, etc. -- Edward Cherlin President Coalition Against Unsolicited Commercial E-mail Help outlaw Spam. Talk to us at 3-Jul-99 11:14:03-GMT,1857;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id HAA14406 for ; Sat, 3 Jul 1999 07:14:03 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id EAA276564 ; Sat, 3 Jul 1999 04:11:00 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA06070; Sat, 3 Jul 99 03:53:40 -0700 Message-Id: <9907031053.AA06070@unicode.org> Errors-To: uni-bounce@unicode.org Content-Type: text/plain X-Uml-Sequence: 8345 (1999-07-03 10:53:24 GMT) From: dickey@clark.net To: Unicode List Cc: unicode@unicode.org Date: Sat, 3 Jul 1999 03:53:23 -0700 (PDT) Subject: Re: Plain Text > > At 08:58 -0700 7/2/1999, Frank da Cruz wrote: > [failing to mention that Ed Cherlin wrote:] > >> The problems we have with ASCII plain text come mainly from a small set of > >> common variant practices. > >> > >> Using CR, LF, or CR/LF as a line or paragraph end > >> Different tab spacings > >> Optional line wrap > >> Formfeed codes vs. computed page breaks > >> BS = DEL or BS-overstrike > >> > >We all have dealt with these annoyances throughout our careers. They are > >indeed annoying, but not impassible impediments. Also, let's not mix up: > > > > . File storage format > > . Interchange format > > . Data entry format > . Rendering options > > On looking through the remainder of this message, I conclude that I > disagree with Frank's attempts to make his own limited experience Perhaps you should introduce yourself - I know who Frank is, and the other contributors to this list at least give the impression of being polite and knowledgable. -- Thomas E. Dickey dickey@clark.net http://www.clark.net/pub/dickey 3-Jul-99 22:53:44-GMT,2716;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id SAA19468 for ; Sat, 3 Jul 1999 18:53:43 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id PAA321514 ; Sat, 3 Jul 1999 15:49:25 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA08130; Sat, 3 Jul 99 15:36:24 -0700 Message-Id: <9907032236.AA08130@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8350 (1999-07-03 22:36:12 GMT) From: Edward Cherlin To: Unicode List Cc: unicode@unicode.org Date: Sat, 3 Jul 1999 15:36:11 -0700 (PDT) Subject: Re: Plain Text At 03:56 -0700 7/3/1999, dickey@clark.net wrote: >> >> At 08:58 -0700 7/2/1999, Frank da Cruz wrote: >> [failing to mention that Ed Cherlin wrote:] >> >> The problems we have with ASCII plain text come mainly from a small >>set of >> >> common variant practices. >> >> >> >> Using CR, LF, or CR/LF as a line or paragraph end >> >> Different tab spacings >> >> Optional line wrap >> >> Formfeed codes vs. computed page breaks >> >> BS = DEL or BS-overstrike >> >> >> >We all have dealt with these annoyances throughout our careers. They are >> >indeed annoying, but not impassible impediments. Also, let's not mix up: >> > >> > . File storage format >> > . Interchange format >> > . Data entry format >> . Rendering options >> >> On looking through the remainder of this message, I conclude that I >> disagree with Frank's attempts to make his own limited experience > >Perhaps you should introduce yourself - I know who Frank is, and the other >contributors to this list at least give the impression of being polite >and knowledgable. > >-- >Thomas E. Dickey >dickey@clark.net >http://www.clark.net/pub/dickey Well, in no particular order, I am Edward Cherlin Spam fighter Participant in standards processes for APL, I18N, Unicode Experience in production of documents including APL, math, music, Chinese, Korean, Japanese, Greek, Russian, Hebrew, Yiddish Author and publisher of The Worldwide Impact of the Unicode Character Set Standard, 1994. BA Honors Math & Philosophy Yale 1967 Buddhist priest Author of The New Newbie Pages at http://www.newbie.net Member of this list for several years. I was part of the discussion with Frank about a Unicode text standard two years ago. Ed Cherlin, President, CAUCE "Everything should be made as simple as possible, __but no simpler__." Attributed to Albert Einstein 3-Jul-99 23:14:02-GMT,1568;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA20747 for ; Sat, 3 Jul 1999 19:14:02 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id QAA323778 ; Sat, 3 Jul 1999 16:09:39 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA08393; Sat, 3 Jul 99 16:00:26 -0700 Message-Id: <9907032300.AA08393@unicode.org> Errors-To: uni-bounce@unicode.org Content-Type: text/plain X-Uml-Sequence: 8351 (1999-07-03 23:00:16 GMT) From: dickey@clark.net To: Unicode List Cc: unicode@unicode.org Date: Sat, 3 Jul 1999 16:00:15 -0700 (PDT) Subject: Re: Plain Text > Well, in no particular order, I am > > Edward Cherlin > Spam fighter > Participant in standards processes for APL, I18N, Unicode > Experience in production of documents including APL, math, music, Chinese, > Korean, Japanese, Greek, Russian, Hebrew, Yiddish > Author and publisher of The Worldwide Impact of the Unicode Character Set > Standard, 1994. > BA Honors Math & Philosophy Yale 1967 > Buddhist priest > Author of The New Newbie Pages at http://www.newbie.net > Member of this list for several years. so? (I don't see any clue for berating Frank about "limited experience", except possibly your implied age ~55 -- for the rest, I don't see anything that matters much) -- Thomas E. Dickey dickey@clark.net http://www.clark.net/pub/dickey 4-Jul-99 9:31:30-GMT,3005;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id FAA18134 for ; Sun, 4 Jul 1999 05:31:29 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id CAA258396 ; Sun, 4 Jul 1999 02:24:16 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA09766; Sun, 4 Jul 99 02:16:33 -0700 Message-Id: <9907040916.AA09766@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8355 (1999-07-04 09:16:18 GMT) From: Edward Cherlin To: Unicode List Cc: unicode@unicode.org Date: Sun, 4 Jul 1999 02:16:17 -0700 (PDT) Subject: Re: Plain Text At 16:00 -0700 7/3/1999, dickey@clark.net wrote: [failing to note that Ed Cherlin wrote in reply to his request for identification] >> Well, in no particular order, I am >> >> Edward Cherlin >> Spam fighter >> Participant in standards processes for APL, I18N, Unicode >> Experience in production of documents including APL, math, music, Chinese, >> Korean, Japanese, Greek, Russian, Hebrew, Yiddish >> Author and publisher of The Worldwide Impact of the Unicode Character Set >> Standard, 1994. >> BA Honors Math & Philosophy Yale 1967 >> Buddhist priest >> Author of The New Newbie Pages at http://www.newbie.net >> Member of this list for several years. [and also omitting Ed's statement about having been in a similar discussion with Frank on this list two years ago, about creating a Unicode text format standard.] > >so? (I don't see any clue for berating Frank about "limited experience", Are you berating me? You didn't ask me for "clues for berating Frank", just who I am. Do you mean that my experience is irrelevant in discussing his experience? Frank? Am I being mean to you? Is my criticism too harsh? If so, I apologize. What did you think about my suggestions for the Unicode text standard? >except possibly your implied age ~55 -- for the rest, I don't see anything >that matters much) Frank's "limited experience" is not youth but insularity. He cites practices current on UNIX systems as though they applied universally. I have used UNIX, DOS, Windows, CP/M, Apple ][, IBM mainframes via timesharing, and several other kinds of computers, dealing with character set problems well outside Frank's range of experience. I forgot to mention that I instigated and managed a software development project for a highly portable APL that came out in English, French, German, Finnish, Russian, and Japanese, on a variety of computer architectures. >-- >Thomas E. Dickey >dickey@clark.net >http://www.clark.net/pub/dickey -- Edward Cherlin President Coalition Against Unsolicited Commercial E-mail Help outlaw Spam. Talk to us at 4-Jul-99 11:25:19-GMT,3107;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id HAA04675 for ; Sun, 4 Jul 1999 07:25:19 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id EAA204984 ; Sun, 4 Jul 1999 04:19:18 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA10393; Sun, 4 Jul 99 04:04:37 -0700 Message-Id: <9907041104.AA10393@unicode.org> Errors-To: uni-bounce@unicode.org Content-Type: text/plain X-Uml-Sequence: 8357 (1999-07-04 11:04:25 GMT) From: dickey@clark.net To: Unicode List Cc: unicode@unicode.org Date: Sun, 4 Jul 1999 04:04:23 -0700 (PDT) Subject: Re: Plain Text [omitted previous discussion] > [and also omitting Ed's statement about having been in a similar discussion > with Frank on this list two years ago, about creating a Unicode text format > standard.] [this appeared redundant, except as a note that you had been introduced to Frank] > > > >so? (I don't see any clue for berating Frank about "limited experience", > > Are you berating me? You didn't ask me for "clues for berating Frank", just hmm (though the nearest dictionary does not convey this, my sense of 'berating' is related to the repetition of the "limited experience". There are indeed degrees here - but then we can argue about shades of meaning. > who I am. Do you mean that my experience is irrelevant in discussing his > experience? It doesn't make a good argument - and most of your listeners stop at that point. (If you wish to be convincing, leave that out and point out the places where his posting leaves out information - and _why_ that is more important than than what he's presenting). > Frank? Am I being mean to you? Is my criticism too harsh? If so, I > apologize. What did you think about my suggestions for the Unicode text > standard? I have a hunch that Frank is home for the weekend. > >except possibly your implied age ~55 -- for the rest, I don't see anything > >that matters much) > > Frank's "limited experience" is not youth but insularity. He cites > practices current on UNIX systems as though they applied universally. > > I have used UNIX, DOS, Windows, CP/M, Apple ][, IBM mainframes via > timesharing, and several other kinds of computers, dealing with character > set problems well outside Frank's range of experience. I forgot to mention I wouldn't be surprised if many people on this list have also used a variety of systems (otherwise they'd not be reading this list ;-). > that I instigated and managed a software development project for a highly > portable APL that came out in English, French, German, Finnish, Russian, > and Japanese, on a variety of computer architectures. I suppose so - but APL itself has little to do with the natural language aspect (perhaps you managed the message library - that would be relevant to your statement). -- Thomas E. Dickey dickey@clark.net http://www.clark.net/pub/dickey 4-Jul-99 16:49:34-GMT,1507;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id MAA12217 for ; Sun, 4 Jul 1999 12:49:33 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id JAA272994 ; Sun, 4 Jul 1999 09:42:06 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA12324; Sun, 4 Jul 99 09:27:16 -0700 Message-Id: <9907041627.AA12324@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8364 (1999-07-04 16:27:00 GMT) From: Curtis Clark To: Unicode List Cc: unicode@unicode.org Date: Sun, 4 Jul 1999 09:26:58 -0700 (PDT) Subject: Dickey vs. Cherlin, was Re: Plain Text I haven't been on this list long (I've found it interesting and useful), and I don't claim any qualifications at all; but I wonder, are these sorts of exchanges common? I can understand that Unicode could generate some strident differences of opinion, but I sense that I'm missing something here. ---------------------------------------------------------------- Curtis Clark http://www.csupomona.edu/~jcclark/ Biological Sciences Department Voice: (909) 869-4062 California State Polytechnic University FAX: (909) 869-4078 Pomona CA 91768-4032 USA jcclark@csupomona.edu 4-Jul-99 16:51:03-GMT,1926;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id MAA12504 for ; Sun, 4 Jul 1999 12:51:03 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id JAA263944 ; Sun, 4 Jul 1999 09:45:02 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA12328; Sun, 4 Jul 99 09:27:17 -0700 Message-Id: <9907041627.AA12328@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8365 (1999-07-04 16:27:01 GMT) From: Curtis Clark To: Unicode List Date: Sun, 4 Jul 1999 09:26:59 -0700 (PDT) Subject: Re: dotless j At 07:40 AM 7/4/99 -0700, Jeroen Hellingman wrote: > The semantics of both i and j >should be that >they loose their dots if you put an accent on top of them, so there never >should be a problem. I'm puzzled by this: 1. Precomposed accented characters, I have read, are included in support of legacy character sets; the ideal is to use a combining accent with a non-accented character. 2. There are issues with combining accents needing to account for the height of the base letter, dots, as well, no doubt, as ascenders and descenders. These are semantic issues, which should be handled by the software. 3. Unicode, it is said, is a plain text standard. (2) and (3) seem to be at odds, unless programs that display plain text become a lot more sophisticated. ---------------------------------------------------------------- Curtis Clark http://www.csupomona.edu/~jcclark/ Biological Sciences Department Voice: (909) 869-4062 California State Polytechnic University FAX: (909) 869-4078 Pomona CA 91768-4032 USA jcclark@csupomona.edu 4-Jul-99 17:41:35-GMT,2050;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id NAA24093 for ; Sun, 4 Jul 1999 13:41:35 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id KAA196170 ; Sun, 4 Jul 1999 10:37:54 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA13836; Sun, 4 Jul 99 10:24:07 -0700 Message-Id: <9907041724.AA13836@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8370 (1999-07-04 17:23:59 GMT) From: John Cowan To: Unicode List Date: Sun, 4 Jul 1999 10:23:57 -0700 (PDT) Subject: Re: dotless j Content-Transfer-Encoding: 7bit Curtis Clark scripsit: > 1. Precomposed accented characters, I have read, are included in support of > legacy character sets; the ideal is to use a combining accent with a > non-accented character. Just so. > 2. There are issues with combining accents needing to account for the > height of the base letter, dots, as well, no doubt, as ascenders and > descenders. These are semantic issues, which should be handled by the > software. I don't know what you mean by "semantic". They are *rendering* issues, which must be handled by displaying-and-printing software. Much other software doesn't care a bit. For example, you can write Java code with comments and identifier names in Yoruba, using combining characters as needed. > 3. Unicode, it is said, is a plain text standard. So it is. > (2) and (3) seem to be at odds, unless programs that display plain text > become a lot more sophisticated. So they must, if they are to handle all of Unicode: BIDI, conjoining Hangul jamo, etc. etc. This is the escape from your dilemma. -- John Cowan cowan@ccil.org I am a member of a civilization. --David Brin 4-Jul-99 18:00:36-GMT,1450;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA27757 for ; Sun, 4 Jul 1999 14:00:36 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id KAA321522 ; Sun, 4 Jul 1999 10:54:52 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA14163; Sun, 4 Jul 99 10:40:44 -0700 Message-Id: <9907041740.AA14163@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 8371 (1999-07-04 17:40:36 GMT) From: Roozbeh Pournader To: Unicode List Cc: Unicode List Date: Sun, 4 Jul 1999 10:40:34 -0700 (PDT) Subject: Re: dotless j On Sun, 4 Jul 1999, Curtis Clark wrote: > 3. Unicode, it is said, is a plain text standard. > > (2) and (3) seem to be at odds, unless programs that display plain text > become a lot more sophisticated. Yes! Don't consider simple scripts like Latin only. If one likes to have plain text Arabic, what should he do? He needs sofisticated software to do that. Unicode is there for all scripts. When it sees that some processing is needed for scripts like Arabic or Devanagari, it allows some processing for scripts like Latin, to solve ambiguities etc. --Roozbeh 4-Jul-99 18:19:02-GMT,8795;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA01657 for ; Sun, 4 Jul 1999 14:19:01 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id LAA187394 ; Sun, 4 Jul 1999 11:11:02 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA14272; Sun, 4 Jul 99 10:51:50 -0700 Message-Id: <9907041751.AA14272@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8372 (1999-07-04 17:51:37 GMT) From: Frank da Cruz To: Unicode List Cc: unicode@unicode.org Date: Sun, 4 Jul 1999 10:51:36 -0700 (PDT) Subject: Re: Plain Text > I conclude that I disagree with Frank's attempts to make his own limited > experience normative... > I'm not sure why my experience has become an issue in this discussion but I can assure you I have a fair amount. My first programming experience was with plugboards and little wires on IBM EAM equipment. My current project, now in its 18th year, is precisely the interchange of text among divergent platforms, with full conversion of both record format and character set. I have written software to do this that has run, at one time or another, on more than 700 different hardware-and-OS platforms, many long dead, and at present on more than 150. This project, which I manage, also produces (and/or collects and distributes, and supports) similar software written by other people both here and abroad, and the entire collection spans practically every computer and operating system that has existed over the past 25-30 years with just a few exceptions. Part of the project is the definition of a protocol for meaningful text transfer. The protocol requires conversion of local formats and character sets to standard ones when sending, and the reverse procedure when receiving. Only international standard character sets are used on the wire, and are tagged using standard ISO-registered identifiers. This protocol has been in production for more than 10 years and is used in many parts parts of the world, especially Eastern and Western Europe, Isreal, Greece, the former USSR, Japan, and the Americas. One of the key questions in designing and implementing such a protocol is "what is a text file?" What distinguishes it from a non-text, or "binary" file? Constant day-to-day experience with a worldwide user base helps me to form what I hope is an adequate grasp of the issues. > >The point is, it doesn't matter. Each platform has its own format for > >internal use, but a standardized interface to the outside world. To > >further demonstrate this fact, if I then tell the computer on the far > >end to "type" or "cat" the file, it will, invariably, send: > > > > This is a line > > Your cultural ignorance/sheltered life-experience is showing. *You* may > live in an environment where these changes are made automatically, but a > lot of us don't. > Then please give counterexamples. > >So who cares what the file format is -- except of course when we want to > >transfer the file to another platform. > > And since I don't use a VT100 simulator anymore, I only encounter this > issue when transfering files to another platform, and as a result I care > all the time. > > >In that case, it is the > >responsibility of each file-transfer agent > > When reading floppy disks? > Of course. One of the biggest problems facing any of us who wishes to live in a world of computing diversity is the failure of file system designers to develop a rational method for tagging files, and indeed, for developing standard interchange formats. That's what we're trying to do here. Consider a minimal platform like DOS. You can set up your DOS system to load different code pages, such as CP850 for West European languages, CP866 for Cyrillic, and so on. Then you can use standard DOS utilities to create and edit text files in many languages (but only one per file). However, no record is kept of the encoding (character set) of each file. This presents rather significant problems even when we stay on the PC, before we ever think about interchanging files. So at minimum, a text file should be tagged according to character set. To my knowledge, this has never been done at the file-system level. What about file type and record format? Data interchange can be done in various ways. One way involves cooperating agents at each end -- e.g. FTP client and server. They can use their own application-specific protocol to control the process. For example, one can say "I'm DOS" and the other "I'm UNIX" and then apply the appropriate conversions. Of course as platforms multiply, we have an n x n problem. Therefore we settle upon standard formats to be used on the wire. Each transfer partner converts to and from these standard formats. Moving files by magnetic media present numerous problems, but only because we have forgotten how to do it. Back in the 1970s, ANSI developed standards for data interchange by magnetic media (e.g. ANSI X3.26-1978) that worked perfectly well until the personal computer revolution came along and standards went out of style. A DOS (or Macintosh or IRIX or any other) diskette is simply not intended for export to other platforms. This is the kind of situation we would like to avoid in the future. Hence this discussion. > You are still claiming that text files as they occur in your computer > subculture are for some reason normative for the rest of us. > Actually I am attempting to achieve an agreement a precise definition of Unicode plain text that allows the text to be already formatted, one that gives us the same capability that we have always had with ASCII (and Latin-x etc) of encoding and presenting information without *requiring* the use of any higher intelligence beyond what is needed to interpret Space, LS, PS, HT, and FF characters, plus whatever else is needed to accommodate bidi, etc. > >Prior to the advent of > >word processors, the idea of "long line as paragraph" never came up. > > Word processing began in the 1960s. I gather you had a later date in mind. > Did you mean specifically WYSIWYG word processors, invented at Xerox in the > late 1970s? > And, before it, NLS, used at government research institutes in the 1960s. But again, that's not plain text. It's "input for a text formatter". It does not stand on its own. > >No, a correct email client will leave it alone. Whether I want my email > >reformatted by your client should be my choice, since only I know what my > >intentions are in sending it. ^^^^^^^^^ > > However, it actually is the recipient's choice, and you can't stop us. > This sounds like quibbling but it's an important point. If I have the capability to compose and format a plain-text message exactly as I want you to see it, the mail system should allow me to mark it as "preformatted plain text" and then you would have to go out of your way to reformat it. Whereas if my mail client sends long lines with no formatting, it should mark it as "plain text to be flowed". Email issues, especially MIME, are a whole new topic, and a controversial one, best avoided here. But a clear statement from the Unicode Consortium on plain text that addresses the issue of formatting might motivate the "email community" to deal with these issues in a productive way. > A growing number of standards specify the use of Unicode text files, > without explicitly defining them. If we get anywhere with this, we will > have to run our proposal past these other groups, including the IETF, the > POSIX committee, programming language standards committees, etc. > Good. Let's try to keep making progress. We all have an intuitive grasp of the meaning of preformatted plain text. You'll find it in many places: . READ.ME files on your software disks. . Program source code. . Traditional (not "legacy") email and netnews. . Voluminous full-text information already online. and so on. We should find a way to carry this notion forward for Unicode in a way that: . Avoids the pitfalls of platform-dependent formatting conventions. . Allows straightforward and unambiguous conversion of 8-bit data to Unicode (and, to the extent possible, vice-versa). . Is independent of any higher-level protocol, markup language, product, or even standard. In other words, the Unicode definition should stand entirely on its own so that files encoded (or transmitted) in this format will be universally understood for years, decades, centuries to come, no matter what else might change, as long as Unicode itself lives on. - Frank 4-Jul-99 18:24:28-GMT,2544;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA02663 for ; Sun, 4 Jul 1999 14:24:28 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id LAA266440 ; Sun, 4 Jul 1999 11:17:45 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA14818; Sun, 4 Jul 99 11:04:45 -0700 Message-Id: <9907041804.AA14818@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Uml-Sequence: 8376 (1999-07-04 18:04:20 GMT) From: Markus Kuhn To: Unicode List Date: Sun, 4 Jul 1999 11:04:16 -0700 (PDT) Subject: Re: Frank and Plain Text Edward Cherlin wrote on 1999-07-04 09:16 UTC: > Frank's "limited experience" is not youth but insularity. He cites > practices current on UNIX systems as though they applied universally. > > I have used UNIX, DOS, Windows, CP/M, Apple ][, IBM mainframes via > timesharing, and several other kinds of computers, dealing with character > set problems well outside Frank's range of experience. Just for the record, let me quickly introduce your discussion partners: Frank da Cruz , whom you attested "limited experience" in the field of inter-platform plaintext exchange, is the author of KERMIT. KERMIT is a widely ported classic terminal emulator with build-in file transmission software. It is most likely available on *all* the platforms that you have ever used, and as the implementor of KERMIT's text-file transmission mechanism, Frank certainly had to worry about the plain text file conventions used on all these systems. He his probably one of the most qualified experts on matters related to the emulation of historic data-entry terminals and inter-platform plain-text format convention. (In case you have never used or heard about KERMIT, please draw the appropriate conclusions regarding the scope of your own experience.) Thomas Dickey is the maintainer of xterm, probably the currently most widely used VT100 terminal emulator on this planet, and the application that primarily has to process all plaintext on Unix workstations in the end. (If you have never heard of xterm, same conclusion.) Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: 4-Jul-99 19:05:41-GMT,1529;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id PAA12778 for ; Sun, 4 Jul 1999 15:05:41 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id LAA190184 ; Sun, 4 Jul 1999 11:57:31 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA16486; Sun, 4 Jul 99 11:44:25 -0700 Message-Id: <9907041844.AA16486@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8380 (1999-07-04 18:44:09 GMT) From: John Cowan To: Unicode List Date: Sun, 4 Jul 1999 11:44:08 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 7bit Frank da Cruz scripsit: > One of the key questions in designing and implementing such a protocol is > "what is a text file?" Indeed. The GNU utilities go to great lengths to process all 256 bytes even in purely text utilities, but none of them (except specific conversion programs) handle multibyte text. > So at minimum, a text file should be tagged according to character set. To > my knowledge, this has never been done at the file-system level. Either that, or there needs to be only one character set! :-) -- John Cowan cowan@ccil.org I am a member of a civilization. --David Brin 4-Jul-99 20:08:35-GMT,2740;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA26524 for ; Sun, 4 Jul 1999 16:08:35 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id NAA255332 ; Sun, 4 Jul 1999 13:01:26 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA17563; Sun, 4 Jul 99 12:45:51 -0700 Message-Id: <9907041945.AA17563@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-Uml-Sequence: 8386 (1999-07-04 19:45:25 GMT) From: "Paul Dempsey (Exchange)" To: Unicode List Date: Sun, 4 Jul 1999 12:45:20 -0700 (PDT) Subject: RE: Plain Text > > Frank da Cruz: > > So at minimum, a text file should be tagged according to character set. To > > my knowledge, this has never been done at the file-system level. > John Cowan: > Either that, or there needs to be only one character set! :-) We'll have to deal with multiple untagged codepages/encodings/charsets for a long time yet. It's unlikely we'll get file systems to carry any meta-information beyond the filename in any portable way and certainly not retroactively. What we CAN do is use encoding signatures for all Unicode files. The various forms of Unicode are still relatively new and we still have a chance to establish the conventions. The Unicode standard lists signatures for _some_ Unicode encodings, in section 13.6 Specials, Encoding Form Signature: UCS-2(UTF-16) FE FF UCS-4 00 00 FE FF However, this is incomplete. The most important thing we're missing from the standard is: UTF-8 EF BB BF These are all the ZERO WIDTH NO BREAK SPACE (a.k.a BYTE ORDER MARK) in the corresponding representation. Without a signature for UTF-8, you can't reliably assume you're working with UTF-8 and not some other MBCS. A number of Microsoft programs (Notepad, Visual Studio, richedit) are using this signature for UTF-8. For the rest of what constitutes "plain text", the Unicode standard covers most of the issues, but not explicitly in one place. The grayer part of this discussion is about what constitutes "preformatted plain text". I don't think this can be standardized to practical effect. That is, you could write a standard, but would anyone use it? This quickly gets into the domain of presentation and document structure, which is beyond the scope of the Unicode standard proper. It is still worthwhile to capture the common conventions and make recommendations. --- Paul Chase Dempsey Microsoft Visual Studio Text Editor Development 4-Jul-99 20:46:36-GMT,2397;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA05825 for ; Sun, 4 Jul 1999 16:46:36 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id NAA12582 ; Sun, 4 Jul 1999 13:41:38 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA19319; Sun, 4 Jul 99 13:33:13 -0700 Message-Id: <9907042033.AA19319@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8391 (1999-07-04 20:33:04 GMT) From: Frank da Cruz To: Unicode List Cc: unicode@unicode.org Date: Sun, 4 Jul 1999 13:33:03 -0700 (PDT) Subject: RE: Plain Text > We'll have to deal with multiple untagged codepages/encodings/charsets > for a long time yet. It's unlikely we'll get file systems to carry any > meta-information beyond the filename in any portable way and certainly > not retroactively. > And I most emphatically recommend against using filenames for this purpose for at least the following reasons: . Different platforms have different filename formats and restrictions as to what can be in a filename, how long it can be, etc. . There is no central registry for filename associations. Horrible confusion arises when different software vendors choose the same association for two different products or, worse, when files are transferred across platforms that have different associations. > For the rest of what constitutes "plain text", the Unicode standard > covers most of the issues, but not explicitly in one place. The grayer > part of this discussion is about what constitutes "preformatted plain > text". I don't think this can be standardized to practical effect. That > is, you could write a standard, but would anyone use it? > Those who needed a guaranteed way to record preformatted plain text in documents that can persist over long periods of time and across all applications and platforms would use it. Even now, there exists such a standard, albeit unwritten, for 8-bit text. For example, almost every word processor and web browser has a "Save as" option for "plain text with line breaks" which, in the general case, is the only reliable interchange format. What will be the Unicode equivalent? - Frank 4-Jul-99 20:58:13-GMT,2175;000000000001 Return-Path: Received: from dfssl.exchange.microsoft.com (dfssl.exchange.microsoft.com [131.107.88.59]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA07800 for ; Sun, 4 Jul 1999 16:58:13 -0400 (EDT) Received: by dfssl with Internet Mail Service (5.5.2648.0) id <3DSG1TJV>; Sun, 4 Jul 1999 13:57:07 -0700 Message-ID: <01D6C7224936D211BA450000F805D5380809563E@TOTO> From: "Paul Dempsey (Exchange)" To: "'Frank da Cruz'" , "Paul Dempsey (Exchange)" Cc: unicode@unicode.org Subject: RE: Plain Text Date: Sun, 4 Jul 1999 13:56:57 -0700 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2648.0) Content-Type: text/plain; charset="iso-8859-1" > > Paul Chase Dempsey: > > We'll have to deal with multiple untagged codepages/encodings/charsets > > for a long time yet. It's unlikely we'll get file systems to carry any > > meta-information beyond the filename in any portable way and certainly > > not retroactively. > > Frank da Cruz > I most emphatically recommend against using filenames for this purpose .. I emphatically agree. I meant to say that the name is the only information you can expect a file system to maintain apart the data in the file. I did not mean to imply that the name should be used to encode any other information. If I did, I would have proposed a notation. Without a reliable means to capture the encoding external to the bits in the file itself, I suggest the standardization of Unicode file signatures. These are already in common use except for UTF-8, and it's useful to extend the practice to UTF-8. ... > Frank da Cruz > Even now, there exists such a standard, albeit unwritten, for 8-bit text. > For example, almost every word processor and web browser has a "Save as" > option for "plain text with line breaks" which, in the general case, is the > only reliable interchange format. What will be the Unicode equivalent? Exactly the same, except Unicode data intead of 8-bit MBCS data. So let's write down the unwritten! Regards, --- Paul Chase Dempsey 4-Jul-99 22:58:59-GMT,2175;000000000011 Return-Path: Received: from light.dkuug.dk (55.ppp1-10.image.dk [212.54.73.247]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id SAA04769 for ; Sun, 4 Jul 1999 18:58:57 -0400 (EDT) Received: (from keld@localhost) by light.dkuug.dk (8.9.3/8.9.3) id AAA03372; Mon, 5 Jul 1999 00:58:56 +0200 Date: Mon, 5 Jul 1999 00:58:56 +0200 From: keld@dkuug.dk To: Frank da Cruz Cc: Unicode List Subject: Re: Plain text: Amendment 1 Message-ID: <19990705005856.B3289@light.dkuug.dk> References: <9907021618.AA21230@unicode.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Mailer: Mutt 0.95.4us In-Reply-To: <9907021618.AA21230@unicode.org>; from Frank da Cruz on Fri, Jul 02, 1999 at 09:17:48AM -0700 On Fri, Jul 02, 1999 at 09:17:48AM -0700, Frank da Cruz wrote: > 90 seconds later... > > 3. Line breaks are indicated by Line Separator, U+2028. Preformatted > text must break lines at column 79 or less to avoid unwanted > reformatting. Column numbers are 1-based, relative to the left or > right margin, according to the previaling directionality, with > single-width characters as the counting unit. A line break is > required at the end of the final line if it is to be considered a > line. (This is to allow append operations to work in the expected > fashion.) > > 4. Paragraph breaks are indicated by two successive Line Separators > or by Paragraph Separator, U+2029. > > Change (4) to: > > 4. Paragraph breaks are indicated by Paragraph Separator, U+2029. > > Add to (3): > > A blank line is indicated by two successive Line Separators. > Two blank lines are indicated by three of them, etc. > > This is to allow paragraphs like this one, which contain embedded > "displays" set off by blank lines that are NOT paragraph separators. could one not use C0 or C1 characters for these, so that the conventions could equally apply to say 8859 character sets? 3) could be something like one out of 3: 1. CR 2. LF 3. CR LF 4) could we use something like one of the C0 characers for that? Keld 4-Jul-99 23:42:12-GMT,3430;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA15758 for ; Sun, 4 Jul 1999 19:42:11 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id QAA270558 ; Sun, 4 Jul 1999 16:32:52 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA20741; Sun, 4 Jul 99 16:13:35 -0700 Message-Id: <9907042313.AA20741@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8397 (1999-07-04 23:13:11 GMT) From: Kermit Software Support To: Unicode List Cc: unicode@unicode.org Date: Sun, 4 Jul 1999 16:13:04 -0700 (PDT) Subject: Re: Plain text: Amendment 1 Keld wrote: > > Frank Wrote: > > > > 4. Paragraph breaks are indicated by two successive Line Separators > > or by Paragraph Separator, U+2029. > > > > Change (4) to: > > > > 4. Paragraph breaks are indicated by Paragraph Separator, U+2029. > > > > Add to (3): > > > > A blank line is indicated by two successive Line Separators. > > Two blank lines are indicated by three of them, etc. > > > > This is to allow paragraphs like this one, which contain embedded > > "displays" set off by blank lines that are NOT paragraph separators. > > could one not use C0 or C1 characters for these, so that the conventions > could equally apply to say 8859 character sets? > They could be, but I think we want to standardize on true Unicode characters whenever we can, since we have the power to define their semantics. The C0 and C1 sets are included for compatibility with existing sets over which the Unicode Consortium has no control, and over which we have been haggling the past few days ("the Mac does this, the PC does that, UNIX does something else"...) Anyway, we can't go back and change existing Latin-Alphabet or PC Code Page files to use consistent record formats -- that's an operating system and programming language issue, not to mention a conversion task that not even Hercules (or Xena) could handle. > 3) could be something like one out of 3: > > 1. CR > 2. LF > 3. CR LF > This is exactly why we should use LS rather than any of the above in Unicode text. Then converting existing 8-bit text to Unicode will have the happy by-product of erasing these differences. As noted previously, I would not object to adding two more "control characters" to Unicode to remove our dependence on C0 and C1 completely: 1. UHT "Unicode Horizontal Tab", which is just like C0 HT except that the tabstops are well-defined (should the tabbing concept be carried forward into Unicode Plain Text, rather than using only spaces). How to define them is, of course, another question. 2. UFF "Unicode Form Feed", like C0 Formfeed, except not in C0. I can't think of any applications for C0 Form Feed other than page feed or page eject, or the analogous action on video terminals, namely clear screen. But I'm sure that C0 FF has been misused in ways I never heard of and therefore a more clearly defined Unicode version might be warranted. However, I'm perfectly happy to stick with C0 HT and FF as long as they are given precise definitions for Unicode Plain Text, and nobody says "legacy" when referring to them :-) Whatever is chosen, let's keep it simple. - Frank 5-Jul-99 3:32:35-GMT,2165;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA18456 for ; Sun, 4 Jul 1999 23:32:34 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id UAA268622 ; Sun, 4 Jul 1999 20:27:50 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA22629; Sun, 4 Jul 99 20:11:42 -0700 Message-Id: <9907050311.AA22629@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8402 (1999-07-05 03:11:31 GMT) From: Jonathan Rosenne To: Unicode List Date: Sun, 4 Jul 1999 20:11:30 -0700 (PDT) Subject: Re: Plain Text I agree with John. The interchange standard should be UTF-8 or UTF-16. The sending and receiving systems should handle conversions. If the receiving system does not tag files, and uses just one encoding, it should convert the file as best it can. This way, the receiving system does not need to recognize a large number of character sets, only those it wishes to support. Since the meaning of CR, LF, CRLF, FF cannot be agreed, I agree additional Unicode characters look like a good solution. And again, the sending and receiving systems should handle conversions. I don't think tabs are needed. Spaces are sufficient. Jony At 11:44 04/07/99 -0700, John Cowan wrote: >Frank da Cruz scripsit: > >> One of the key questions in designing and implementing such a protocol is >> "what is a text file?" > >Indeed. The GNU utilities go to great lengths to process all 256 bytes >even in purely text utilities, but none of them (except specific conversion >programs) handle multibyte text. > >> So at minimum, a text file should be tagged according to character set. To >> my knowledge, this has never been done at the file-system level. > >Either that, or there needs to be only one character set! >:-) > >-- >John Cowan cowan@ccil.org > I am a member of a civilization. --David Brin > 5-Jul-99 3:44:29-GMT,2096;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA19615 for ; Sun, 4 Jul 1999 23:44:29 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id UAA188358 ; Sun, 4 Jul 1999 20:38:11 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA22757; Sun, 4 Jul 99 20:26:13 -0700 Message-Id: <9907050326.AA22757@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8403 (1999-07-05 03:26:05 GMT) From: Edward Cherlin To: Unicode List Cc: unicode@unicode.org Date: Sun, 4 Jul 1999 20:26:03 -0700 (PDT) Subject: Re: Dickey vs. Cherlin, was Re: Plain Text At 09:26 -0700 7/4/1999, Curtis Clark wrote: >I haven't been on this list long (I've found it interesting and useful), >and I don't claim any qualifications at all; but I wonder, are these sorts >of exchanges common? I can understand that Unicode could generate some >strident differences of opinion, but I sense that I'm missing something >here. > > >---------------------------------------------------------------- >Curtis Clark http://www.csupomona.edu/~jcclark/ >Biological Sciences Department Voice: (909) 869-4062 >California State Polytechnic University FAX: (909) 869-4078 >Pomona CA 91768-4032 USA jcclark@csupomona.edu I have to say it surprises me. I wasn't trying to flame Frank, and we haven't had anyone take exception to the tone of the discussion in the several years I've been here. We do tell each other quite plainly when an opinion seems ill-founded, as in Michael's comments on my notion of encoding IPA extensions using XML. -- Edward Cherlin President Coalition Against Unsolicited Commercial E-mail Help outlaw Spam. Talk to us at 5-Jul-99 5:59:35-GMT,2826;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id BAA01699 for ; Mon, 5 Jul 1999 01:59:34 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id WAA261196 ; Sun, 4 Jul 1999 22:53:27 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA23597; Sun, 4 Jul 99 22:43:55 -0700 Message-Id: <9907050543.AA23597@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8405 (1999-07-05 05:43:43 GMT) From: Edward Cherlin To: Unicode List Date: Sun, 4 Jul 1999 22:43:42 -0700 (PDT) Subject: Frank & Ed Evidently my diagnosis, that Frank da Cruz had insufficient experience in a cross-platform environment, was completely wrong, so I apologize for writing it. It puzzles me even more, then, that Frank writes in his Unicode text file proposal as if Unix practice, or more particularly his own practice (including practice in file format conversions in cross-platform data transfers), is normative, not just for other software, but for file formats on other platforms, without saying how this norm is to be implemented so that file format conversion ceases to be a problem for all applications. Also: How do we get agreement on such a standard from, e.g., Microsoft? How do we get users to stop using current methods? How do we deal with delimited database transfer files with a fixed limit on line length? How do we deal with legacy data? I find myself dealing with Unicode text created by Windows and Windows applications quite frequently now, with line ends marked in little-endian fashion as 0D 00 0A 00 What do we do about that? I entirely agree that cross-platform protocols should be defined so that we stop having conversion problems (such as translating text file formats upon transfer, as ftp does), but it can't be done within a character set standard, nor by defining a text file format without file format handling for applications on different platforms. I have had to collect or in some cases write conversion routines for text file transfer, including text files in ASCII, 8-bit character sets, and Unicode. I would much rather have the operating systems do it. If someone can explain to me how Frank's proposal will lead to that desired goal better than Frank's proposal with my suggested amendments, I'll be happy to go along. So can we discuss the issues now? -- Edward Cherlin President Coalition Against Unsolicited Commercial E-mail Help outlaw Spam. Talk to us at 5-Jul-99 9:13:48-GMT,8259;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id FAA19730 for ; Mon, 5 Jul 1999 05:13:47 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id CAA258406 ; Mon, 5 Jul 1999 02:01:43 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA25036; Mon, 5 Jul 99 01:46:42 -0700 Message-Id: <9907050846.AA25036@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8410 (1999-07-05 08:46:30 GMT) From: Edward Cherlin To: Unicode List Cc: unicode@unicode.org Date: Mon, 5 Jul 1999 01:46:28 -0700 (PDT) Subject: Re: Plain Text At 10:51 -0700 7/4/1999, Frank da Cruz wrote: [Ed Cherlin wrote:] >> I conclude that I disagree with Frank's attempts to make his own limited >> experience normative... [snip] I withdraw the remark, in view of other information received, and the answers to my objections which Frank has provided, like the next. [Frank] >> >So who cares what the file format is -- except of course when we want to >> >transfer the file to another platform. >> > >> >In that case, it is the >> >responsibility of each file-transfer agent >> [Ed] >> When reading floppy disks? >> [Frank] >Of course. One of the biggest problems facing any of us who wishes to live >in a world of computing diversity is the failure of file system designers to >develop a rational method for tagging files, and indeed, for developing >standard interchange formats. That's what we're trying to do here. > >Consider a minimal platform like DOS. You can set up your DOS system to >load different code pages, such as CP850 for West European languages, CP866 >for Cyrillic, and so on. Then you can use standard DOS utilities to create >and edit text files in many languages (but only one per file). However, no >record is kept of the encoding (character set) of each file. This presents >rather significant problems even when we stay on the PC, before we ever >think about interchanging files. > >So at minimum, a text file should be tagged according to character set. To >my knowledge, this has never been done at the file-system level. > >What about file type and record format? Data interchange can be done in >various ways. One way involves cooperating agents at each end -- e.g. FTP >client and server. They can use their own application-specific protocol >to control the process. For example, one can say "I'm DOS" and the other >"I'm UNIX" and then apply the appropriate conversions. Of course as >platforms multiply, we have an n x n problem. Therefore we settle upon >standard formats to be used on the wire. Each transfer partner converts to >and from these standard formats. > >Moving files by magnetic media present numerous problems, but only because >we have forgotten how to do it. Back in the 1970s, ANSI developed standards >for data interchange by magnetic media (e.g. ANSI X3.26-1978) that worked >perfectly well until the personal computer revolution came along and >standards went out of style. A DOS (or Macintosh or IRIX or any other) >diskette is simply not intended for export to other platforms. > >This is the kind of situation we would like to avoid in the future. Hence >this discussion. > >> You are still claiming that text files as they occur in your computer >> subculture are for some reason normative for the rest of us. >> >Actually I am attempting to achieve an agreement a precise definition of >Unicode plain text that allows the text to be already formatted, one that >gives us the same capability that we have always had with ASCII (and Latin-x >etc) of encoding and presenting information without *requiring* the use of >any higher intelligence beyond what is needed to interpret Space, LS, PS, >HT, and FF characters, plus whatever else is needed to accommodate bidi, >etc. [snip] [Frank] >> >Whether I want my email >> >reformatted by your client should be my choice, since only I know what my >> >intentions are in sending it. ^^^^^^^^^ >> >> However, it actually is the recipient's choice, and you can't stop us. >> >This sounds like quibbling but it's an important point. If I have the >capability to compose and format a plain-text message exactly as I want you >to see it, the mail system should allow me to mark it as "preformatted plain >text" and then you would have to go out of your way to reformat it. Whereas >if my mail client sends long lines with no formatting, it should mark it as >"plain text to be flowed". This is the key point for me. You acknowledge the need for flavors of text other than your preformatted plain text. I thought you were holding out for one flavor only. Now we can discuss the flavors, such as delimited database interchange files with lines of arbitrary length. Presumably we can define them using some of the apparatus that is becoming available in XML or as MIME data types. Would it make sense, then, to create a formal XML definition of plain text files, with a leading BOM, no interpretations for any tags, the minimum set of control characters, and the appropriate set of transformation formats? That would get around my earlier objection, about how to make an implementation available on all platforms. What about corresponding MIME types? >Email issues, especially MIME, are a whole new topic, and a controversial >one, best avoided here. But a clear statement from the Unicode Consortium >on plain text that addresses the issue of formatting might motivate the >"email community" to deal with these issues in a productive way. > >> A growing number of standards specify the use of Unicode text files, >> without explicitly defining them. If we get anywhere with this, we will >> have to run our proposal past these other groups, including the IETF, the >> POSIX committee, programming language standards committees, etc. >> >Good. Let's try to keep making progress. > >We all have an intuitive grasp of the meaning of preformatted plain text. >You'll find it in many places: > > . READ.ME files on your software disks. Preformatted or reflowable. > . Program source code. Preformatted. > . Traditional (not "legacy") email and netnews. There is presently no way to specify preformatted or reflowable. > . Voluminous full-text information already online. Including Unicode tables and other database interchange formats. >and so on. We should find a way to carry this notion forward for Unicode >in a way that: > > . Avoids the pitfalls of platform-dependent formatting conventions. > > . Allows straightforward and unambiguous conversion of 8-bit data to > Unicode (and, to the extent possible, vice-versa). > > . Is independent of any higher-level protocol, markup language, > product, or even standard. In other words, the Unicode definition > should stand entirely on its own so that files encoded (or transmitted) > in this format will be universally understood for years, decades, > centuries to come, no matter what else might change, as long as Unicode > itself lives on. Hear, hear. >- Frank To summarize your answer to my objections, we are defining a new format independent of previous conventions, in which we can specify usage of the minimal set of formatting characters regardless of usage in text files of 7-bit ASCII and 8-bit character sets of any kind, while allowing for a few variant flavors of text, such as preformatted, reflowable, and database. To which I add, that we can specify a portable implementation, too, and not have to wait for computer and OS vendors to get on board. Well, apparently there are no hard feelings from Frank over my earlier harsh words, so perhaps nobody else need be offended on his behalf. In case anybody missed it elsewhere, I apologize for misunderstanding Frank, and for giving the impression that I was attacking him personally. -- Edward Cherlin President Coalition Against Unsolicited Commercial E-mail Help outlaw Spam. Talk to us at 5-Jul-99 9:30:53-GMT,1554;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id FAA21152 for ; Mon, 5 Jul 1999 05:30:53 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id CAA93992 ; Mon, 5 Jul 1999 02:23:20 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA25135; Mon, 5 Jul 99 02:04:54 -0700 Message-Id: <9907050904.AA25135@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-Uml-Sequence: 8411 (1999-07-05 09:04:46 GMT) From: Michael Everson To: Unicode List Date: Mon, 5 Jul 1999 02:04:44 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by watsun.cc.columbia.edu id FAA21152 Ar 10:51 -0700 1999-07-04, scrνobh Frank da Cruz: >Moving files by magnetic media present numerous problems, but only because >we have forgotten how to do it. Oh, is that the reason? I thought it was a Y2K thing, that on January 1 all the magnetic tapes would go "fzzzzzzzzzzst!" like in Mission Impossible. Frivolously, -- Michael Everson * Everson Gunn Teoranta * http://www.indigo.ie/egt 15 Port Chaeimhghein Νochtarach; Baile Αtha Cliath 2; Ιire/Ireland Guthαn: +353 1 478 2597 ** Facsa: +353 1 478 2597 (by arrangement) 27 Pαirc an Fhιithlinn; Baile an Bhσthair; Co. Αtha Cliath; Ιire 5-Jul-99 14:53:48-GMT,1211;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id KAA24315 for ; Mon, 5 Jul 1999 10:53:48 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id HAA278358 ; Mon, 5 Jul 1999 07:44:56 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA29395; Mon, 5 Jul 99 07:31:40 -0700 Message-Id: <9907051431.AA29395@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8426 (1999-07-05 14:31:31 GMT) From: Peter_Constable@sil.org To: Unicode List Date: Mon, 5 Jul 1999 07:31:30 -0700 (PDT) Subject: NLF (was Frank and Ed, was Plain Text) Content-Transfer-Encoding: 7bit >I find myself dealing with Unicode text created by Windows and Windows applications quite frequently now, with line ends marked in little-endian fashion as 0D 00 0A 00 Indeed, this practice has surprised me. Chris Pratley: can you comment on why Word 97 does this rather than using PS? Peter 5-Jul-99 15:00:09-GMT,3951;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id LAA25531 for ; Mon, 5 Jul 1999 11:00:09 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id HAA187530 ; Mon, 5 Jul 1999 07:46:42 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA29478; Mon, 5 Jul 99 07:33:11 -0700 Message-Id: <9907051433.AA29478@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8427 (1999-07-05 14:32:44 GMT) From: Peter_Constable@sil.org To: Unicode List Date: Mon, 5 Jul 1999 07:32:43 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 7bit >Of course. One of the biggest problems facing any of us who wishes to live in a world of computing diversity is the failure of file system designers to develop a rational method for tagging files, and indeed, for developing standard interchange formats. That's what we're trying to do here. .. >What about file type and record format?... >Actually I am attempting to achieve an agreement a precise definition of Unicode plain text that allows the text to be already formatted, one that gives us the same capability that we have always had with ASCII (and Latin-x etc) of encoding and presenting information without *requiring* the use of any higher intelligence beyond what is needed to interpret Space, LS, PS, HT, and FF characters... I find myself in agreement with Ken W's comments a few messages back. I'm also inclined to say that you are wanting to define (in effect) a MIME type, and that part of the confusion / disagreement that has arisen in this thread comes about by calling this type "plain text". You want a file that is tagged with null markup to be interpreted in a specific way (as a text document as opposed, e.g. to a database) and with specific layout formatting. As was pointed out in an earlier message, and as we are all familiar with, sometime files that contain only text characters and no tagging are used for purposes other than this, such as the CSV database. Also, there are times when I've had such text files in which I intend all of the text that exists between instances of { BOF, EOF, NLF } to appear on a single line, regardless of length (e.g. in source code), and other times when I expect it to wrap to whatever width is appropriate for the window in which it is viewed. All of these are legitimate things to want to be able to do with a file in this format that we have always known as "plain text". Neither the intended meaning of the content, nor the intended appearance have ever been part of the definition of plain text. Thus, I think you should expect some objection to any suggestion that "plain text" should refer to a file that is intended to be interpreted in a specific way, i.e. as a text document with specific layout formatting. Plain text can be neither more nor less than what is has always been. As we apply plain text to the Unicode context, Ken's comments were on the mark. That is not to say that it isn't reasonable, or desireable, to specify a file format to be used for text documents with specific layout formatting such that it will always appear as the author intended, and such that no markup is used beyond a standard interpretation of the characters (separating this file format from others such as PDF). We'd all benefit from it, if an agreement can be made. I just think that we may need to call it something else. And this is what Frank has acknowledged, though he may not have done so consciously: >the mail system should allow me to mark it as "preformatted plain text" We're not just talking about plain text here, we're talking about a specific kind of plain text. Peter 5-Jul-99 17:04:57-GMT,7653;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id NAA24261 for ; Mon, 5 Jul 1999 13:04:57 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id KAA246240 ; Mon, 5 Jul 1999 10:00:25 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA01854; Mon, 5 Jul 99 09:45:38 -0700 Message-Id: <9907051645.AA01854@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8434 (1999-07-05 16:45:27 GMT) From: Frank da Cruz To: Unicode List Cc: Unicode List Date: Mon, 5 Jul 1999 09:45:26 -0700 (PDT) Subject: Re: Plain Text [Ed wrote...] > It puzzles me even more, then, that Frank writes in his Unicode text > file proposal as if Unix practice, or more particularly his own practice > (including practice in file format conversions in cross-platform data > transfers), is normative, not just for other software, but for file > formats on other platforms, without saying how this norm is to be > implemented so that file format conversion ceases to be a problem for > all applications. > I'll try to be more explicit. Whether we know it or not, text interchange methods are well-established in the pre-Unicode world, at least at the record-format level (character sets are another matter, but we know that). When I sit at my { terminal, terminal emulator, xterm window } and tell the host to "type" or "cat" a file, the internal text format is translated to the de facto canonical one, primarily that the local convention for line separation/termination is translated to CRLF. When I transfer a text file with FTP or any other file transfer protocol I know about, the same thing happens (see, e.g. RFC959). However, many of us are confused by the fact that local conventions differ, and perceive this as an obstacle to interchange because, for example, it is difficult to read a PC diskette on a UNIX workstation or a Macintosh, or because of the increasing amounts of email we get that uses some encoding or format we don't understand. These are problems that we have an opportunity to solve in the conversion of 8-bit text to Unicode. > How do we get agreement on such a standard from, e.g., Microsoft? > Hopefully Microsoft's representatives to the Unicode Consortium will be supportive, as some of the commentary already seems to indicate. > How do we get users to stop using current methods? > We don't have to. If the Unicode Standard defines what plain text is, then conversion of 8-bit text to Unicode will put all the divergent platform-specific formats into the same Unicode format. > How do we deal with delimited database transfer files with a fixed > limit on line length? > I don't see how these files would be affected. You can put line separators in them if you want, or leave them out. > How do we deal with legacy data? > How do convert existing 7-bit and 8-bit plain-text files to Unicode plain text? The straightforward conversion is: . Source line -> Destination line terminated by LS. This is according to whatever the local definition of "line" is (UNIX, Macintosh, DOS, VMS, MVS, ...). And of course: . Source character set converted to Unicode. This seems obvious. C0 control characters are kept, including Horizontal Tab and Form Feed. C1 control characters are kept if the source character set has them (e.g. a Latin Alphabet) and translated otherwise (e.g. CP850). Additional wrinkles (options) might include: . Tabs expanded to spaces based on the desired tab stops, which should be 1,9,17,35,... BY DEFAULT (meaning you can supply your own tab stops). . Heuristics might be used to identify paragraphs and to separate them by Paragraph Separator. For example, a blank line is replaced by PS. Obviously there are pitfalls. . Any conversion program would probably need an option to deal with files with "word processor" record format, in which a line is really a paragraph. > I find myself dealing with Unicode text created by Windows and Windows > applications quite frequently now, with line ends marked in > little-endian fashion as > > 0D 00 0A 00 > > What do we do about that? > I would say that this practice should be discouraged ("be conservative in what you 'send'") in any application that creates or saves Unicode text files. But it should be allowed for ("be liberal in what you 'receive'") in any conversion/import program. > I entirely agree that cross-platform protocols should be defined so that > we stop having conversion problems (such as translating text file formats > upon transfer, as ftp does), but it can't be done within a character set > standard, nor by defining a text file format without file format handling > for applications on different platforms. > I don't think anybody can presume to offer a panacea for differing application formats, other than to define a text-file format that can be used for export/import/interchange, as we have now with most popular applications. We simply need to extend this idea to Unicode. > I have had to collect or in some cases write conversion routines for text > file transfer, including text files in ASCII, 8-bit character sets, and > Unicode. I would much rather have the operating systems do it. > The operating system doesn't know what format or encoding is used in a file. It would be nice if this information was saved along with the file, but it usually isn't. If, in the transition to an all-Unicode computing environment, we specify not only the encoding but also a standard record format for interchange of plain text -- including (but not requiring) preformatted plain text -- we won't have to worry about operating systems, file systems, or presentation-layer issues in text-file transfer ever again. Obviously we will always have to worry about format conversions between applications that do NOT use plain text data files. But by defining a low-level baseline format for plain text, there will always be a method for recording and transmitting textual information that rises above ("sinks below") those differences, and that can always be used across platforms, distance, and time. > ... You acknowledge the need for flavors of text > other than your preformatted plain text. I thought you were holding out > for one flavor only. Now we can discuss the flavors, such as delimited > database interchange files with lines of arbitrary length. Presumably we > can define them using some of the apparatus that is becoming available in > XML or as MIME data types. > No, thase are higher-level protocols that will go out of fashion some day, probably sooner than you think. Of course you can define or use all the higher level protocols you want, but you should bear in mind they are ephemeral. If you want something that lasts forever, do it in Unicode without reference to MIME, *ML, or anything else, and keep it extremely simple. > To summarize your answer to my objections, we are defining a new format > independent of previous conventions, in which we can specify usage of the > minimal set of formatting characters regardless of usage in text files of > 7-bit ASCII and 8-bit character sets of any kind, while allowing for a few > variant flavors of text, such as preformatted, reflowable, and > database. > Yes. > To which I add, that we can specify a portable implementation, > too, and not have to wait for computer and OS vendors to get on board. > Double yes. - Frank 5-Jul-99 18:06:41-GMT,4435;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA08662 for ; Mon, 5 Jul 1999 14:06:40 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id LAA185694 ; Mon, 5 Jul 1999 11:02:46 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA02161; Mon, 5 Jul 99 10:50:02 -0700 Message-Id: <9907051750.AA02161@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8435 (1999-07-05 17:49:51 GMT) From: Frank da Cruz To: Unicode List Cc: Unicode List Date: Mon, 5 Jul 1999 10:49:50 -0700 (PDT) Subject: Re: Plain Text [Peter wrote] > I find myself in agreement with Ken W's comments a few messages back. I'm > also inclined to say that you are wanting to define (in effect) a MIME > type, and that part of the confusion / disagreement that has arisen in > this thread comes about by calling this type "plain text". > I most emphatically do not want to define a MIME type, because MIME will disappear some day but Unicode will last forever (if we do it right). > You want a file that is tagged with null markup to be interpreted in a > specific way (as a text document as opposed, e.g. to a database) and with > specific layout formatting. As was pointed out in an earlier message, and > as we are all familiar with, sometime files that contain only text > characters and no tagging are used for purposes other than this, such as > the CSV database. Also, there are times when I've had such text files in > which I intend all of the text that exists between instances of { BOF, > EOF, NLF } to appear on a single line, regardless of length (e.g. in > source code), and other times when I expect it to wrap to whatever width > is appropriate for the window in which it is viewed. > All of that is fine. I'm only proposing that we codify existing practice. If Unicode has a Line Separator (and it does), then if I put it in a file, it should serve its purpose. Ditto for Paragraph Separator. Ditto for C0 HT and FF (even though those purposes might be ill-defined), in the absence of "native" Unicode replacements for them. I agree that marking a "plain-text" stream as "preformatted" or "to be flowed" is a higher-level issue. However, we must also agree that plain text CAN be preformatted and not ALWAYS flowed, and that Unicode already contains the mechanisms to do it. > All of these are legitimate things to want to be able to do with a file in > this format that we have always known as "plain text". Neither the > intended meaning of the content, nor the intended appearance have ever > been part of the definition of plain text. Thus, I think you should expect > some objection to any suggestion that "plain text" should refer to a file > that is intended to be interpreted in a specific way, i.e. as a text > document with specific layout formatting. Plain text can be neither more > nor less than what is has always been. As we apply plain text to the > Unicode context, Ken's comments were on the mark. > > That is not to say that it isn't reasonable, or desireable, to specify a > file format to be used for text documents with specific layout formatting > such that it will always appear as the author intended, and such that no > markup is used beyond a standard interpretation of the characters > (separating this file format from others such as PDF). We'd all benefit > from it, if an agreement can be made. I just think that we may need to > call it something else. > "Preformatted plain text"? It's not catchy but I think it says what it means. > I certainly empathise with a desire to have a standard for preformatted > plain text. Here's the first paragraph of something in a message sent to > me recently. > Yes, "fractured plain text" comes from a flawed conversion algorithm, e.g. when pasting from a web page into an email window (a "double-ended break" in this case: misinterpretation of the left margin as leading spaces by the copier and gratuitous word wrapping by the paster). Obviously that's an application issue. However, I do believe that if we can establish a baseline for preformatted plain text, makers of such applications will have a better idea of how to interchange text. - Frank 5-Jul-99 20:45:18-GMT,3272;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA14621 for ; Mon, 5 Jul 1999 16:45:17 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id NAA243178 ; Mon, 5 Jul 1999 13:39:10 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA04382; Mon, 5 Jul 99 13:21:31 -0700 Message-Id: <9907052021.AA04382@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Uml-Sequence: 8444 (1999-07-05 20:20:07 GMT) From: "Jonathan Coxhead" To: Unicode List Date: Mon, 5 Jul 1999 13:20:06 -0700 (PDT) Subject: Re: Plain text: Amendment 1 Content-Transfer-Encoding: 7BIT | As noted previously, I would not object to adding two more "control | characters" to Unicode to remove our dependence on C0 and C1 | completely: | | 1. UHT "Unicode Horizontal Tab", which is just like C0 HT except | that | the tabstops are well-defined (should the tabbing concept be | carried forward into Unicode Plain Text, rather than using only | spaces). How to define them is, of course, another question. My thoughts on this indicate that explicit tab widths are not appropriate: the only real requirement for plain text is that the columns line up. So we could have a character COLUMN SEPARATOR (CSEP) to go with LINE SEPARATOR (LSEP) and PARAGRAPH SEPARATOR (PSEP). It should interact with these as follows. "Within a paragraph that contains a CSEP, each LSEP-delimited line represents a row of a table. The table has as many columns as the maximum number of CSEP characters in any line. Each column should be wide enough to accommodate the longest column-contents in any line in that column. No inter-column spacing is provided: if there is to be space between columns, one column or the other must contain explicit space chatacters." So the general form of a table would be PSEP ... CSEP ... CSEP ... LSEP ... CSEP ... LSEP ... CSEP ... CSEP ... PSEP An unsophisticated renderer may choose to render CSEP as a tab to an 8-column tab stop, and this may often give acceptable results. | Whatever is chosen, let's keep it simple. This is simple to define, but not to render. Also, it doesn't give control over left/right/centre justifying each column. If this is important, I suppose the solution would be a SPACE FILL character, like \hfil in TeX, which (when occuring in a table, i e, a paragraph with at least one CSEP character) provides enough space to pad the entry it appears in to the full width available. This would allow a column to be right-justified (start all entries with SPACE FILL), centre-justified (put a SPACE FILL character before and after the entries), or even justified on a particular character, e g, the decimal point FULL STOP (break it into 2 columns, by writing CSEP, FULL STOP instead of FULL STOP, and right-justify the first, left-justify the second). /| o o o (_|/ /| (_/ 5-Jul-99 22:33:53-GMT,1510;000000000011 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id SAA10623 for ; Mon, 5 Jul 1999 18:33:51 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id PAA200950 ; Mon, 5 Jul 1999 15:23:09 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA06878; Mon, 5 Jul 99 15:07:37 -0700 Message-Id: <9907052207.AA06878@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Uml-Sequence: 8453 (1999-07-05 22:07:14 GMT) From: keld@dkuug.dk To: Unicode List Date: Mon, 5 Jul 1999 15:07:08 -0700 (PDT) Subject: Re: Plain text: Amendment 1 On Mon, Jul 05, 1999 at 03:16:01AM -0700, keld@dkuug.dk wrote: > 3) could be something like one out of 3: > > 1. CR > 2. LF > 3. CR LF To clarify: I think "line break" could follow the conventions currently in use on the Internet: Accept all of the three above forms, but only generate one form, preferably the CR LF sequence. It seems like the Internet is going to standardize on UTF-8, and as UTF-8 encodes C0 as a single octet, I think there would be much sense in chosing a C0 sequence for the "line break" function. I think the paragraph break could then be chosen as one of the C0 Information separators, possibly the Record Separator aka control-^ . Just my 2 eurocent Keld 5-Jul-99 22:58:31-GMT,2280;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id SAA15543 for ; Mon, 5 Jul 1999 18:58:31 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id PAA199038 ; Mon, 5 Jul 1999 15:54:19 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA08168; Mon, 5 Jul 99 15:43:35 -0700 Message-Id: <9907052243.AA08168@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8457 (1999-07-05 22:43:27 GMT) From: Frank da Cruz To: Unicode List Cc: unicode@unicode.org Date: Mon, 5 Jul 1999 15:43:26 -0700 (PDT) Subject: Re: Plain text: Amendment 1 > On Mon, Jul 05, 1999 at 03:16:01AM -0700, keld@dkuug.dk wrote: > > 3) could be something like one out of 3: > > > > 1. CR > > 2. LF > > 3. CR LF > > To clarify: I think "line break" could follow the conventions > currently in use on the Internet: Accept all of the three above forms, > but only generate one form, preferably the CR LF sequence. > > It seems like the Internet is going to standardize on UTF-8, > and as UTF-8 encodes C0 as a single octet, I think there would be > much sense in chosing a C0 sequence for the "line break" function. > > I think the paragraph break could then be chosen as one of > the C0 Information separators, possibly the Record Separator > aka control-^ . > I think the problem with this idea is that if we look at a Unicode text file and see CR and/or LF in it, we don't know if those characters came from the private text format of a 7- or 8-bit file that was converted to Unicode without any record-format conversion, or if they are the "Unicode" CR and LF. Therefore this would only move the problem of incompatible record formats from the old world (of DOS, Windows, UNIX, Macintosh) to the new one. It's better to have Unicode characters LS and PS (and I think also Tab/Column-Separator and Page Separator) than to recycle the C0 controls. This ensures round-trip integrity without having to know the history of the data ("it came originally from DOS so to convert it from Unicode to UNIX we need to...") - Frank 5-Jul-99 23:12:00-GMT,1857;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA18562 for ; Mon, 5 Jul 1999 19:11:58 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id QAA198948 ; Mon, 5 Jul 1999 16:05:19 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA08383; Mon, 5 Jul 99 15:51:57 -0700 Message-Id: <9907052251.AA08383@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Uml-Sequence: 8458 (1999-07-05 22:51:42 GMT) From: Otto Stolz To: Unicode List Date: Mon, 5 Jul 1999 15:51:36 -0700 (PDT) Subject: Re: Plain Text Am 1999-07-01 um 13:00 h hat Otto Stolz geschrieben: > In MS-DOS (or PC-DOS and other DOS variants) on the PC, it is not > well defined, at all: [...] > - '09'x (HT) means either a tabulator [...] or a line-break, I am no more sure about the HT used as a line-break in plain text. It is indeed used in an internal Word-format (Word 2.0 for DOS, and perhaps in later versions) for this purpose, but I haven't kept an old Word implementation, so I cannot check Word's input conversion from plain text to this format. Current Word for Windows input conversions from plain text interpret some C0 characters thus (checked with Word 97): '09'x (TAB) tabulator '0A'x (LF) paragraph break '0B'x (VT) line break '0C'x (FF) page break '0D'x (CR) ignored '0E'x (SO) [sic!] column break Still, my main point holds: In MS-DOS, plain text is not well defined, as there are wide variations in the usage and meaning of several controll characters. Best wishes, Otto Stolz 6-Jul-99 3:34:39-GMT,2640;000000000001 Return-Path: Received: from proxy4.ba.best.com (proxy4.ba.best.com [206.184.139.15]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA04624 for ; Mon, 5 Jul 1999 23:34:25 -0400 (EDT) Received: from macchiato.com (dynamic45.pm03.mv.best.com [209.24.240.173]) by proxy4.ba.best.com (8.9.3/8.9.2/best.out) with ESMTP id UAA22744; Mon, 5 Jul 1999 20:32:14 -0700 (PDT) Message-ID: <37817931.41860B@macchiato.com> Date: Mon, 05 Jul 1999 20:34:09 -0700 From: Mark Davis X-Mailer: Mozilla 4.6 [en] (Win98; U) X-Accept-Language: en,de-CH,fr-CH,it MIME-Version: 1.0 To: Frank da Cruz CC: Unicode List Subject: Re: Plain text: Amendment 1 References: <9907052243.AA08164@unicode.org> Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit A lot of the discussion of line termination relates to technical report #13. Any suggestions for additional information for that report would be welcome. (http://www.unicode.org/unicode/reports/tr13/) Mark Frank da Cruz wrote: > > On Mon, Jul 05, 1999 at 03:16:01AM -0700, keld@dkuug.dk wrote: > > > 3) could be something like one out of 3: > > > > > > 1. CR > > > 2. LF > > > 3. CR LF > > > > To clarify: I think "line break" could follow the conventions > > currently in use on the Internet: Accept all of the three above forms, > > but only generate one form, preferably the CR LF sequence. > > > > It seems like the Internet is going to standardize on UTF-8, > > and as UTF-8 encodes C0 as a single octet, I think there would be > > much sense in chosing a C0 sequence for the "line break" function. > > > > I think the paragraph break could then be chosen as one of > > the C0 Information separators, possibly the Record Separator > > aka control-^ . > > > I think the problem with this idea is that if we look at a Unicode > text file and see CR and/or LF in it, we don't know if those > characters came from the private text format of a 7- or 8-bit file > that was converted to Unicode without any record-format conversion, > or if they are the "Unicode" CR and LF. Therefore this would only > move the problem of incompatible record formats from the old world > (of DOS, Windows, UNIX, Macintosh) to the new one. > > It's better to have Unicode characters LS and PS (and I think also > Tab/Column-Separator and Page Separator) than to recycle the C0 > controls. This ensures round-trip integrity without having to know > the history of the data ("it came originally from DOS so to convert > it from Unicode to UNIX we need to...") > > - Frank 6-Jul-99 15:00:44-GMT,2552;000000000005 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id LAA06958 for ; Tue, 6 Jul 1999 11:00:43 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id HAA200904 ; Tue, 6 Jul 1999 07:53:44 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA14263; Tue, 6 Jul 99 07:20:41 -0700 Message-Id: <9907061420.AA14263@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Uml-Sequence: 8478 (1999-07-06 14:18:21 GMT) From: Kevin Bracey To: Unicode List Date: Tue, 6 Jul 1999 07:18:20 -0700 (PDT) Subject: Re: NLF (was Frank and Ed, was Plain Text) In message <9907051432.AA29431@unicode.org> Peter_Constable@sil.org wrote: > > > >I find myself dealing with Unicode text created by Windows and Windows > applications quite frequently now, with line ends marked in little-endian > fashion as > > 0D 00 0A 00 > > Indeed, this practice has surprised me. > > Chris Pratley: can you comment on why Word 97 does this rather than using > PS? > I think I can partially answer this from experience on our (non-MS) environment. Our system continues to use our native line-ending type (LF only) when dealing with Unicode data, for compatibility. In particular, when converted to UTF-8, which is how Unicode is normally passed around our OS, the data will have standard looking line endings - if PS or LS were used, many non-UTF-8 aware parts of the system would get confused. Also, a lot of Unicode data is converted from non-Unicode sources - conversion will almost always leave C0 and C1 characters untouched. Changing to PS and LS would need knowledge of the source data's line ending conventions, which is hard to determine automatically. If you also need round-trip conversion (eg Shift-JIS data in an HTML form -> Unicode browser workings -> Shift-JIS submission to server), messing with line endings is almost out of the question. All other encodings use C0 controls for line endings - it's hard to make a change for one particular encoding that does it differently. -- Kevin Bracey, Senior Software Engineer Pace Micro Technology plc Tel: +44 (0) 1223 725228 645 Newmarket Road Fax: +44 (0) 1223 725328 Cambridge, CB5 8PB, United Kingdom WWW: http://www.acorn.co.uk/ 6-Jul-99 15:48:02-GMT,3748;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id LAA21789 for ; Tue, 6 Jul 1999 11:48:01 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id IAA246046 ; Tue, 6 Jul 1999 08:30:04 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA14360; Tue, 6 Jul 99 07:26:42 -0700 Message-Id: <9907061426.AA14360@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8480 (1999-07-06 14:23:46 GMT) From: John Cowan To: Unicode List Date: Tue, 6 Jul 1999 07:23:45 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 7bit Edward Cherlin wrote: > This is the key point for me. You acknowledge the need for flavors of text > other than your preformatted plain text. I thought you were holding out for > one flavor only. Indeed, but "preformatted plain text" has traditionally been called "plain text", or in MIME "text/plain", and this terminology ought not to be revised unwarrantedly. Other species of plain text should have a distinguishing adjective. > Now we can discuss the flavors, such as delimited database > interchange files with lines of arbitrary length. We can, but I think we would do well to nail down preformatted plain text (aka "plain text") first, as it is the most stable. > Presumably we can define > them using some of the apparatus that is becoming available in XML or as > MIME data types. Would it make sense, then, to create a formal XML > definition of plain text files, with a leading BOM, no interpretations for > any tags, the minimum set of control characters, and the appropriate set of > transformation formats? No, at least for the XML part. (You could create a full-SGML definition, but I question the purpose of it, except perhaps to help in defining a Unicode-preformatted-plain-text grove model.) XML compels special interpretations for "<" and "&" and requires matching enclosing tags; preformatted plain text has no such requirements. > That would get around my earlier objection, about > how to make an implementation available on all platforms. What about > corresponding MIME types? The corresponding MIME type is "text/plain; charset=utf-8" or "... utf-16". Anything else should have a different MIME type or at least different parameters. > Preformatted or reflowable. I have not seen ones that are not preformatted. > > . Traditional (not "legacy") email and netnews. > > There is presently no way to specify preformatted or reflowable. There is a widespread presumption for preformatted, although sometimes the formatting is done by the creating software, not the user, alas. Rendering software usually has at least an option to display as-is. > To summarize your answer to my objections, we are defining a new format > independent of previous conventions, in which we can specify usage of the > minimal set of formatting characters regardless of usage in text files of > 7-bit ASCII and 8-bit character sets of any kind, Yes. > while allowing for a few > variant flavors of text, such as preformatted, reflowable, and database. And of these, preformatted is the most important and stable, and should be specified first. The others can be specified ad libitum later. -- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau, Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge / Politzer 6-Jul-99 16:52:25-GMT,2051;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id MAA10222 for ; Tue, 6 Jul 1999 12:52:24 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id JAA242372 ; Tue, 6 Jul 1999 09:38:54 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA16405; Tue, 6 Jul 99 08:49:29 -0700 Message-Id: <9907061549.AA16405@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8490 (1999-07-06 15:45:40 GMT) From: John Cowan To: Unicode List Date: Tue, 6 Jul 1999 08:45:35 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 7bit Frank da Cruz wrote: > I most emphatically do not want to define a MIME type, because MIME will > disappear some day but Unicode will last forever (if we do it right). Technically, "MIME types" are called "media types", and what they really are is named interchange formats. You *are* trying to develop an interchange format; making it a media type requires only finding a name and filling out a short registration form. As I said in an earlier message, MIME rules provide a strong case for distinguishing between "text/plain" and "application/character-stream", (where "application" here really means "other" i.e. "catchall".) The former must be composed of lines with a maximum length of (IIRC) 998 characters; the latter has no such restrictions. Text/plain could still include both reflowable and preformatted text, but I believe the weight of history is in favor of using that term for preformatted text only. -- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau, Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge / Politzer 6-Jul-99 16:52:29-GMT,4215;000000000005 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id MAA10261 for ; Tue, 6 Jul 1999 12:52:28 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id JAA90060 ; Tue, 6 Jul 1999 09:38:57 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA16005; Tue, 6 Jul 99 08:36:26 -0700 Message-Id: <9907061536.AA16005@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8488 (1999-07-06 15:30:50 GMT) From: John Cowan To: Unicode List Date: Tue, 6 Jul 1999 08:30:34 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 7bit Frank da Cruz wrote: > We don't have to. If the Unicode Standard defines what plain text is, > then conversion of 8-bit text to Unicode will put all the divergent > platform-specific formats into the same Unicode format. Or some other widely accepted source of standardization, such as Oasis or ECMA or ISO or even W3C (though the first three, IMHO, have a better "fit" to the subject matter). > C1 control characters are kept if the source character > set has them (e.g. a Latin Alphabet) and translated otherwise > (e.g. CP850). I take this to mean "Characters 0x80 to 0x9F are zero-bit-extended if the source character set has C1 characters; if it does not (like CP850, CP1252, or VISCII), they are translated to their proper Unicode graphic equivalents." > . Heuristics might be used to identify paragraphs and to separate them > by Paragraph Separator. For example, a blank line is replaced by PS. > Obviously there are pitfalls. Indeed. For example, blank lines in source code, e.g., are not necessarily paragraph marks. This might be a reasonable QOI issue. > . Any conversion program would probably need an option to deal with > files with "word processor" record format, in which a line is really > a paragraph. Note that arbitrary-length lines do not meet the MIME definition of "text" (and nor does UTF-16 text); such things should really have a media type of "application/character-stream" or the like, analogous to "application/octet-stream" but with a charset parameter. > > 0D 00 0A 00 > > > > What do we do about that? > > > I would say that this practice should be discouraged ("be conservative in > what you 'send'") in any application that creates or saves Unicode text > files. But it should be allowed for ("be liberal in what you 'receive'") in > any conversion/import program. Does this Windows-Unicode text always have a proper little-endian BOM, as I believe it does? If so, then the only problem is the precise value of line terminator. In practice, much of the Unicode text (perhaps all of it) in the world today uses old line terminators, and I think they must be explicitly allowed in a flexible definition of preformatted Unicode plain text, even if tagged with SHOULD NOT. > No, thase are higher-level protocols that will go out of fashion some day, > probably sooner than you think. Of course you can define or use all the > higher level protocols you want, but you should bear in mind they are > ephemeral. SGML is almost as old, as computer things go, as plain text. Though it was not standardized until 1986, it was devised in 1974; ASCII itself only dates to 1963 or so. Moreover, unlike most file formats, SGML is character-based, not octet-based, and does not depend on any specific processing application, so whatever process refreshes Unicode data will refresh SGML data too. (XML is merely a special case of SGML.) I agree that preformatted plain text should not depend on SGML, though; that is putting Cart before Horse. [snip] > Yes. [snip] > Double yes. Sounds like a case of violent agreement. -- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau, Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge / Politzer 6-Jul-99 18:06:01-GMT,1998;000000000005 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA01861 for ; Tue, 6 Jul 1999 14:05:58 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id KAA258748 ; Tue, 6 Jul 1999 10:57:06 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA19726; Tue, 6 Jul 99 10:37:16 -0700 Message-Id: <9907061737.AA19726@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8500 (1999-07-06 17:34:29 GMT) From: John Cowan To: Unicode List Date: Tue, 6 Jul 1999 10:34:27 -0700 (PDT) Subject: UTR #13 comments (was: Plain text: Amendment 1) Content-Transfer-Encoding: 7bit Mark Davis wrote: > A lot of the discussion of line termination relates to technical report #13. > Any suggestions for additional information for that report would be welcome. My suggestions: 1) The NEL character in the C1 set (0x85) is the ISO equivalent of EBCDIC NL (0x15) and this mapping is duly given in the EBCDIC code page mappings on the Unicode FTP site. The text should therefore advise applications to treat U+0085 (NL/NEL) as a newline, not U+0015 (NAK). 2) There should be a warning that some old documents use bare CR (0x0D) to do underlining or other overstriking; an application that converts such text should do a more complex conversion, though treating bare CR as a NLF is marginally acceptable even for these documents (which may then wind up containing occasional lines with only spaces and underscores). -- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau, Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge / Politzer 6-Jul-99 18:07:23-GMT,2852;000000000005 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA02179 for ; Tue, 6 Jul 1999 14:07:22 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id KAA265860 ; Tue, 6 Jul 1999 10:56:23 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA19432; Tue, 6 Jul 99 10:28:04 -0700 Message-Id: <9907061728.AA19432@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8499 (1999-07-06 17:25:42 GMT) From: John Cowan To: Unicode List Date: Tue, 6 Jul 1999 10:25:40 -0700 (PDT) Subject: Re: Plain text: Amendment 1 Content-Transfer-Encoding: 7bit Frank da Cruz wrote: > [I]f we look at a Unicode > text file and see CR and/or LF in it, we don't know if those > characters came from the private text format of a 7- or 8-bit file > that was converted to Unicode without any record-format conversion, > or if they are the "Unicode" CR and LF. The semantics of CR and LF in Unicode 2.x *are* the ambiguous ones inherited from the 7-bit controls; there are no other semantics. But this has been changed in Unicode 3.0: see UTR #13 (http://www.unicode.org/unicode/reports/tr13/), which will be a normative part of Unicode 3.0. Note well that UTR #13 does not solely prescribe the semantics of CR and LF during conversion to and from Unicode, but also the semantics of CR and LF *in* Unicode. XML, a major Unicode application, takes almost the same point of view. (IMHO, XML should be modified to accept LS as a line-end character.) > Therefore this would only > move the problem of incompatible record formats from the old world > (of DOS, Windows, UNIX, Macintosh) to the new one. Indeed. But the only real problem there is that some people and applications (notably nroff output) use bare CR in plain text to produce physical or notional overprinting. Otherwise, it is perfectly fine to take the UTR #13 viewpoint. > It's better to have Unicode characters LS and PS (and I think also > Tab/Column-Separator and Page Separator) than to recycle the C0 > controls. This ensures round-trip integrity without having to know > the history of the data ("it came originally from DOS so to convert > it from Unicode to UNIX we need to...") As for HT and FF, nobody uses them incompatibly, and introducing new characters for them is supererogation at best. -- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau, Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge / Politzer 6-Jul-99 20:21:22-GMT,1739;000000000001 Return-Path: Received: from osiris.taz.de (osiris.taz.de [194.162.12.2]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA11732 for ; Tue, 6 Jul 1999 16:21:20 -0400 (EDT) Received: from track.hal.taz.de (track.hal.taz.de [10.1.0.1]) by osiris.taz.de (8.9.3/8.9.3) with ESMTP id WAA22660; Tue, 6 Jul 1999 22:21:18 +0200 Received: from diva.edv.taz.de (diva.edv.taz.de [10.1.1.44]) by track.hal.taz.de (8.9.3/8.9.3) with ESMTP id WAA13247; Tue, 6 Jul 1999 22:21:13 +0200 (MET DST) Date: Tue, 6 Jul 1999 22:21:13 +0200 (MEST) From: Roman Czyborra X-Sender: czyborra@diva.edv.taz.de To: Unicode List , John Cowan , Frank da Cruz Subject: Re: Plain Text In-Reply-To: <9907061616.AA17333@unicode.org> Message-ID: Organization: http://czyborra.com/ @ http://taz.de/ Gender: male MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII > Text/plain could still include both reflowable and preformatted > text, but I believe the weight of history is in favor of using > that term for preformatted text only. Please read http://imc.org/draft-gellens-format (also known as http://www.ietf.org/internet-drafts/draft-gellens-format-06.txt) about the Content-Type: text/plain;charset=UTF-8;format=flowed > MIME will disappear some day but Unicode will last forever The Internet and MIME will evolve but I don't see them vanish any earlier than Unicode. MIME has been integrated into the majority of platforms, browsers and mailreaders worldwide. Without MIME we wouldn't be able to properly send multilingual text anywhere. 6-Jul-99 21:47:42-GMT,1792;000000000005 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id RAA07183 for ; Tue, 6 Jul 1999 17:47:41 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id OAA186772 ; Tue, 6 Jul 1999 14:35:38 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA24360; Tue, 6 Jul 99 14:25:17 -0700 Message-Id: <9907062125.AA24360@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8513 (1999-07-06 21:24:03 GMT) From: "Tony Harminc" To: Unicode List Date: Tue, 6 Jul 1999 14:24:01 -0700 (PDT) Subject: Re: Plain text: Amendment 1 On 6 Jul 99, at 10:25, John Cowan wrote: > As for HT and FF, nobody uses them incompatibly, and > introducing new characters for them is supererogation at best. Actually the question of HT and FF is the most bothersome one, for me. There are (at least) two problems: HT and FF both depend in some sense on the user's environment, e.g. page length (paper size if the "rendering engine" is a printer or hardcopy terminal), and tab stop settings. HT has ambiguous semantics when the HT occurs when the cursor is already at a tab stop. If the cursor got to a tab stop because of an HT, then there is no argument - another HT moves to the next tab stop. But if the cursor got there because of ordinary, implicit movement, then some systems ignore an HT (i.e. stay in the same place), while others move on to the next stop. Granted, this is mainly a problem of input methods rather than data storage or interchange, but I don't think it's quite fair to say that no one uses HT incompatibly. Tony H. 6-Jul-99 22:57:17-GMT,1684;000000000005 Return-Path: Received: from inergen.sybase.com (inergen.sybase.com [192.138.151.43]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id SAA24819 for ; Tue, 6 Jul 1999 18:57:16 -0400 (EDT) Received: from smtp1.sybase.com (sybgate.sybase.com [130.214.220.35]) by inergen.sybase.com with ESMTP id PAA16767; Tue, 6 Jul 1999 15:58:18 -0700 (PDT) Received: from birdie.sybase.com (birdie.sybase.com [130.214.140.3]) by smtp1.sybase.com with SMTP id PAA23039; Tue, 6 Jul 1999 15:57:15 -0700 (PDT) Received: by birdie.sybase.com (5.x/SMI-SVR4/SybEC3.5) id AA04633; Tue, 6 Jul 1999 15:57:14 -0700 Date: Tue, 6 Jul 1999 15:57:14 -0700 From: kenw@sybase.com (Kenneth Whistler) Message-Id: <9907062257.AA04633@birdie.sybase.com> To: fdc@watsun.cc.columbia.edu Subject: Re: Plain Text Cc: unicode@unicode.org, kenw@sybase.com X-Sun-Charset: US-ASCII > So at minimum, a text file should be tagged according to character set. Whoa! Wait a minute. How do we get from here to there? If it's tagged, it's not a *plain* text file, but something else. The way ahead out of the character set identity morass for "text files" is to use the Universal Character Set -- that way, once again, we will know how to interpret plain text files. The rest of this discussion is about something else other than what the Unicode Standard means by "plain text", and has, as far as I can tell, more to do with devising a kind of a lowest common denominator document format standard for interoperability. While people on this list may find that interesting to discuss, it is rather orthogonal to the intended scope of the Unicode Standard. --Ken Whistler 6-Jul-99 23:07:50-GMT,1372;000000000005 Return-Path: Received: from inergen.sybase.com (inergen.sybase.com [192.138.151.43]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA27626 for ; Tue, 6 Jul 1999 19:07:49 -0400 (EDT) Received: from smtp1.sybase.com (sybgate.sybase.com [130.214.220.35]) by inergen.sybase.com with ESMTP id QAA18832; Tue, 6 Jul 1999 16:08:40 -0700 (PDT) Received: from birdie.sybase.com (birdie.sybase.com [130.214.140.3]) by smtp1.sybase.com with SMTP id QAA25245; Tue, 6 Jul 1999 16:07:37 -0700 (PDT) Received: by birdie.sybase.com (5.x/SMI-SVR4/SybEC3.5) id AA04637; Tue, 6 Jul 1999 16:07:37 -0700 Date: Tue, 6 Jul 1999 16:07:37 -0700 From: kenw@sybase.com (Kenneth Whistler) Message-Id: <9907062307.AA04637@birdie.sybase.com> To: fdc@watsun.cc.columbia.edu Subject: RE: Plain Text Cc: unicode@unicode.org, kenw@sybase.com X-Sun-Charset: US-ASCII > > The grayer > > part of this discussion is about what constitutes "preformatted plain > > text". I don't think this can be standardized to practical effect. That > > is, you could write a standard, but would anyone use it? > > > Those who needed a guaranteed way to record preformatted plain text in > documents that can persist over long periods of time and across all > applications and platforms would use it. At the moment, this format is called a "book". :-) --Ken 6-Jul-99 23:35:26-GMT,1544;000000000001 Return-Path: Received: from inergen.sybase.com (inergen.sybase.com [192.138.151.43]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA06353 for ; Tue, 6 Jul 1999 19:35:26 -0400 (EDT) Received: from smtp1.sybase.com (sybgate.sybase.com [130.214.220.35]) by inergen.sybase.com with ESMTP id QAA22879; Tue, 6 Jul 1999 16:36:29 -0700 (PDT) Received: from birdie.sybase.com (birdie.sybase.com [130.214.140.3]) by smtp1.sybase.com with SMTP id QAA27662; Tue, 6 Jul 1999 16:35:24 -0700 (PDT) Received: by birdie.sybase.com (5.x/SMI-SVR4/SybEC3.5) id AA04643; Tue, 6 Jul 1999 16:35:24 -0700 Date: Tue, 6 Jul 1999 16:35:24 -0700 From: kenw@sybase.com (Kenneth Whistler) Message-Id: <9907062335.AA04643@birdie.sybase.com> To: fdc@watsun.cc.columbia.edu Subject: Re: Plain text: Amendment 1 Cc: unicode@unicode.org, kenw@sybase.com X-Sun-Charset: US-ASCII > I think the problem with this idea is that if we look at a Unicode > text file and see CR and/or LF in it, we don't know if those > characters came from the private text format of a 7- or 8-bit file > that was converted to Unicode without any record-format conversion, > or if they are the "Unicode" CR and LF. Therefore this would only > move the problem of incompatible record formats from the old world > (of DOS, Windows, UNIX, Macintosh) to the new one. The unfortunate horse is already out of the burning barn on this one. So now we have to add a stable to the new Unicode garage. See Unicode Technical Report #13. --Ken 6-Jul-99 23:35:44-GMT,1286;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA06404 for ; Tue, 6 Jul 1999 19:35:44 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id QAA203350 ; Tue, 6 Jul 1999 16:31:17 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA25399; Tue, 6 Jul 99 16:12:09 -0700 Message-Id: <9907062312.AA25399@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8518 (1999-07-06 23:07:52 GMT) From: kenw@sybase.com (Kenneth Whistler) To: Unicode List Cc: unicode@unicode.org, kenw@sybase.com Date: Tue, 6 Jul 1999 16:07:21 -0700 (PDT) Subject: RE: Plain Text > > The grayer > > part of this discussion is about what constitutes "preformatted plain > > text". I don't think this can be standardized to practical effect. That > > is, you could write a standard, but would anyone use it? > > > Those who needed a guaranteed way to record preformatted plain text in > documents that can persist over long periods of time and across all > applications and platforms would use it. At the moment, this format is called a "book". :-) --Ken 6-Jul-99 23:45:36-GMT,2026;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA08869 for ; Tue, 6 Jul 1999 19:45:35 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id QAA10440 ; Tue, 6 Jul 1999 16:42:01 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA26067; Tue, 6 Jul 99 16:32:14 -0700 Message-Id: <9907062332.AA26067@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8519 (1999-07-06 23:30:50 GMT) From: kenw@sybase.com (Kenneth Whistler) To: Unicode List Cc: unicode@unicode.org, kenw@sybase.com Date: Tue, 6 Jul 1999 16:30:44 -0700 (PDT) Subject: Re: Plain text: Amendment 1 Jonathan suggested: > > My thoughts on this indicate that explicit tab widths are not > appropriate: the only real requirement for plain text is that the > columns line up. So we could have a character > > COLUMN SEPARATOR > > (CSEP) to go with LINE SEPARATOR (LSEP) and PARAGRAPH SEPARATOR (PSEP). This isn't going to happen. Column alignment in tables is clearly a higher-level document formatting issue -- not a problem to be solved by attributing complex layout attributes to yet another format control character in the character encoding standard. > > So the general form of a table would be > > PSEP ... CSEP ... CSEP ... LSEP > ... CSEP ... LSEP > ... CSEP ... CSEP ... > PSEP > No, a table is an object defined at a higher level. > > | Whatever is chosen, let's keep it simple. Frank got that one right. We already got TAB's, ineluctably. So define some interoperable behavior on them, as is already done for the kind of preformatted plain text Frank is talking about. Otherwise, use spaces. Any other attempts to push more complex formatting down to the bare minimum preformatted plain text format is bound to fail, IMO. --Ken > 7-Jul-99 0:15:33-GMT,3405;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id UAA16352 for ; Tue, 6 Jul 1999 20:15:32 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id RAA260764 ; Tue, 6 Jul 1999 17:11:34 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA26592; Tue, 6 Jul 99 16:58:21 -0700 Message-Id: <9907062358.AA26592@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8521 (1999-07-06 23:57:08 GMT) From: kenw@sybase.com (Kenneth Whistler) To: Unicode List Cc: unicode@unicode.org, kenw@sybase.com Date: Tue, 6 Jul 1999 16:57:07 -0700 (PDT) Subject: Re: Plain text: Amendment 1 John Cowan wrote: > > The semantics of CR and LF in Unicode 2.x *are* the ambiguous > ones inherited from the 7-bit controls; there are no other semantics. > But this has been changed in Unicode 3.0: see UTR #13 > (http://www.unicode.org/unicode/reports/tr13/), which will be a > normative part of Unicode 3.0. This is not the case. UTR #13 *is* to be considered part of the Unicode Standard, Version 3.0: http://www.unicode.org/unicode/standard/versions/Unicode3.0-beta.html However, UTR #13 constitutes "Unicode Newline *Guidelines*" [emphasis added]. There is no conformance specification and there are no normative implications. The scope constitutes: "a set of recommendations for handling these characters so as to minimize the effects on users." Think of UTR #13 as a late addition to Chapter 5, Implementation Guidelines, that did not make it into the actual printed text of The Unicode Standard, Version 3.0, forthcoming. > Note well that UTR #13 does not > solely prescribe the semantics of CR and LF during conversion to and > from Unicode, but also the semantics of CR and LF *in* Unicode. It makes suggestions. It does not normatively prescribe. > > As for HT and FF, nobody uses them incompatibly, and > introducing new characters for them is supererogation at best. I would agree with this. > Mark Davis wrote: > > > A lot of the discussion of line termination relates to technical report #13. > > Any suggestions for additional information for that report would be welcome. > > My suggestions: > > 1) The NEL character in the C1 set (0x85) is the ISO equivalent of > EBCDIC NL (0x15) and this mapping is duly given in the EBCDIC code page > mappings on the Unicode FTP site. The text should therefore advise > applications to treat U+0085 (NL/NEL) as a newline, not U+0015 (NAK). This was a typo/oversight in the text of UTR #13 and will be corrected. > > 2) There should be a warning that some old documents use bare > CR (0x0D) to do underlining or other overstriking; an application > that converts such text should do a more complex conversion, though > treating bare CR as a NLF is marginally acceptable even for these > documents (which may then wind up containing occasional lines > with only spaces and underscores). This is a good suggestion to add to the text of UTR #13. --Ken > > -- > John Cowan http://www.ccil.org/~cowan cowan@ccil.org > Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau, > Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies. > -- Coleridge / Politzer > 7-Jul-99 0:47:40-GMT,3104;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id UAA25012 for ; Tue, 6 Jul 1999 20:47:39 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id RAA252212 ; Tue, 6 Jul 1999 17:42:46 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA26896; Tue, 6 Jul 99 17:29:02 -0700 Message-Id: <9907070029.AA26896@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8523 (1999-07-07 00:26:52 GMT) From: Frank da Cruz To: Unicode List Cc: unicode@unicode.org Date: Tue, 6 Jul 1999 17:26:51 -0700 (PDT) Subject: Re: Plain Text > > So at minimum, a text file should be tagged according to character set. > > Whoa! Wait a minute. How do we get from here to there? > > If it's tagged, it's not a *plain* text file, but something else. > Sorry, I meant externally tagged, e.g. in the directory entry, along with the size, date, etc. (The lack of this kind of external tagging is a pet peeve of long duration, but is not exactly relevant to this discussion.) > The way ahead out of the character set identity morass for "text files" > is to use the Universal Character Set -- that way, once again, we > will know how to interpret plain text files. > Agreed! Well... At least if we are successful, and some new consortium doesn't come along xx years from now and declare Unicode to be "legacy" and its own new-and-improved universal encoding to be the only one to use from now on. At which point, we might need to differentiate "legacy" Unicode data from the new code, just as we now need to distinguish Unicode from Macintosh Quickdraw, Latin-1, etc. (Saying there will be only one character set in the future is like saying a network address can be 8 bits because there will never be more than 256 computers on a network :-) > The rest of this discussion is about something else other than what > the Unicode Standard means by "plain text", and has, as far as I can > tell, more to do with devising a kind of a lowest common denominator > document format standard for interoperability. While people on this list > may find that interesting to discuss, it is rather orthogonal to the > intended scope of the Unicode Standard. > If it is, it shouldn't be. If we rely on some other organization to worry about this (which one has the authority?) and Unicode outlives the standards and products of that organization, then we're back to "all bets are off". On the other hand, if we can back up the statement that Unicode is a plain-text standard with a definition of plain text that incorporates "lowest common denominator document format standard for interoperability" I think we will have added significant value and endurance to Unicode. The discussion seems to be trailing off -- I suppose I'll wait a few days to see what else comes up and then attempt to write something up (with full consideration of TR13). - Frank 7-Jul-99 2:44:42-GMT,1655;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id WAA07895 for ; Tue, 6 Jul 1999 22:44:41 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id TAA191402 ; Tue, 6 Jul 1999 19:37:18 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA27639; Tue, 6 Jul 99 19:27:23 -0700 Message-Id: <9907070227.AA27639@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8525 (1999-07-07 02:26:09 GMT) From: John Cowan To: Unicode List Date: Tue, 6 Jul 1999 19:26:07 -0700 (PDT) Subject: Re: Plain text: Amendment 1 Content-Transfer-Encoding: 7bit Kenneth Whistler scripsit: > However, UTR #13 constitutes "Unicode Newline *Guidelines*" [emphasis > added]. There is no conformance specification and there are no > normative implications. The scope constitutes: "a set of > recommendations for handling these characters so as to minimize the > effects on users." Think of UTR #13 as a late addition to Chapter 5, > Implementation Guidelines, that did not make it into the actual printed > text of The Unicode Standard, Version 3.0, forthcoming. Ah, I missed that point. But my point was that whereas Unicode 2.0 had nothing to say about CR and LF and N(E)L, Unicode 3.0 does. -- John Cowan cowan@ccil.org I am a member of a civilization. --David Brin 7-Jul-99 2:45:21-GMT,1919;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id WAA08109 for ; Tue, 6 Jul 1999 22:45:20 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id TAA243160 ; Tue, 6 Jul 1999 19:37:31 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA27576; Tue, 6 Jul 99 19:23:24 -0700 Message-Id: <9907070223.AA27576@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8524 (1999-07-07 02:22:09 GMT) From: John Cowan To: Unicode List Date: Tue, 6 Jul 1999 19:22:07 -0700 (PDT) Subject: Re: Plain Text Content-Transfer-Encoding: 7bit Kenneth Whistler scripsit: > > > > So at minimum, a text file should be tagged according to character set. > > Whoa! Wait a minute. How do we get from here to there? > > If it's tagged, it's not a *plain* text file, but something else. I believe the reference was to file metadata like the application tag on the Mac, rather than to anything in-band. > The rest of this discussion is about something else other than what > the Unicode Standard means by "plain text", and has, as far as I can > tell, more to do with devising a kind of a lowest common denominator > document format standard for interoperability. While people on this list > may find that interesting to discuss, it is rather orthogonal to the > intended scope of the Unicode Standard. Just so. Historically, such document have been called "plain text" documents. What Unicode means by "plain text" is simply a stream of characters. -- John Cowan cowan@ccil.org I am a member of a civilization. --David Brin 7-Jul-99 15:43:51-GMT,4432;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id LAA20987 for ; Wed, 7 Jul 1999 11:43:51 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id IAA245160 ; Wed, 7 Jul 1999 08:35:39 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA01732; Wed, 7 Jul 99 08:11:57 -0700 Message-Id: <9907071511.AA01732@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8535 (1999-07-07 15:10:18 GMT) From: Mark Davis To: Unicode List Cc: Unicode List Date: Wed, 7 Jul 1999 08:10:16 -0700 (PDT) Subject: Re: Plain text: Tab stops Content-Transfer-Encoding: 7bit HT has even more ambiguous semantics than you indicate. We did a survey a few years ago of word processors and desktop publishing programs, and found a wide range of different behaviors. Suppose you have a set of tab stops, e.g. at 12pt, 36pt, 72pt, etc. You also have a string of text containing tabs. The tabs in the text divide up the text into a list of tab fields (the text between tabs). There are four problematic situations. 1. A tab field would touch or overlap a previous tab field if placed at the tab stop.* Possible behaviors we observed here were: - go to the next tab stop - go to the next line, at that tab stop. - go to the next line, at the start - ignore the tab, treat it as a space, and merge with the next tab field. 2. There are more tab fields than tab stops. Possible behaviors we observed here were: - go to the next line, at that tab stop. - go to the next line, at the start - ignore the tab, treat it as a space, and merge with the next tab field. - manufacture implicit tab stops past the end, e.g. at every 36 points, or at every 8 em. 3. A tab field would exceed the paragraph margin. Possible behaviors we observed here were: - go to the next line, at the start - go to the next line, at the first tab stop. 4. Tabs are used in non-left flush lines (e.g. with centered or right-flush lines). Possible behaviors we observed here were: - ignore the flush setting on the line. - apply the flush to just the first tab field. - apply the flush to just the last tab field. - lay out the tab fields as if the text were left-flush, then shift the entire line to center or right-flush it. (This comes up with pretty random looking tabulation.) Some DTP programs, despite our best efforts to figure out the rules they were using, appeared to be pretty random in their behavior. This is especially the case with #4. * Overlap (#1) does not only mean that the tab field is too big for the tab stop; it also happens with mixtures of left, right and center tabs. Look at the following example, where '[' means left tab stop, and '|' means centered tab stop, and '~' means tab (and use monospaced font to see properly): [ | aaaaaaaaaaaa~bbbbbbb The bbbbbbb text can't be placed at the centered tab stop properly without overlapping the aaaaaaaaaaaa. Overlap can also happen when the second tab field is centered or right flush and is so large that it overlaps with the left margin. Mark Tony Harminc wrote: > On 6 Jul 99, at 10:25, John Cowan wrote: > > > As for HT and FF, nobody uses them incompatibly, and > > introducing new characters for them is supererogation at best. > > Actually the question of HT and FF is the most bothersome one, for > me. There are (at least) two problems: > > HT and FF both depend in some sense on the user's environment, e.g. > page length (paper size if the "rendering engine" is a printer or > hardcopy terminal), and tab stop settings. > > HT has ambiguous semantics when the HT occurs when the cursor is > already at a tab stop. If the cursor got to a tab stop because of an > HT, then there is no argument - another HT moves to the next tab > stop. But if the cursor got there because of ordinary, implicit > movement, then some systems ignore an HT (i.e. stay in the same > place), while others move on to the next stop. Granted, this is > mainly a problem of input methods rather than data storage or > interchange, but I don't think it's quite fair to say that no one > uses HT incompatibly. > > Tony H. 8-Jul-99 0:15:59-GMT,2754;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id UAA00495 for ; Wed, 7 Jul 1999 20:15:58 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id RAA270376 ; Wed, 7 Jul 1999 17:08:32 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA09145; Wed, 7 Jul 99 16:52:09 -0700 Message-Id: <9907072352.AA09145@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7BIT X-Uml-Sequence: 8554 (1999-07-07 23:51:58 GMT) From: "Jonathan Coxhead" To: Unicode List Date: Wed, 7 Jul 1999 16:51:57 -0700 (PDT) Subject: Re: Plain text: Amendment 1 Content-Transfer-Encoding: 7BIT | > So we could have a character | > | > COLUMN SEPARATOR | > | > (CSEP) to go with LINE SEPARATOR (LSEP) and PARAGRAPH SEPARATOR (PSEP). | | This isn't going to happen. Column alignment in tables is clearly a | higher-level document formatting issue -- not a problem to be solved | by attributing complex layout attributes to yet another format | control character in the character encoding standard. Couldn't the same once have been said of "advance to next line"? Originally derived from 2 hardware control commands, but now made abstract as LSEP? There various ways to express the semantic "advance to next column" in plain text, chiefly: ---insert enough spaces to make the lines line up; ---insert an HT character; ---insert a number of HT characters. The descriptions for LSEP and PSEP say "may be used to express this semantic unambiguously". Confronted by a requirement that the concept of vertically-aligned columns might be an important part of plain text, the consistent option seems to be a character whose only purpose is to separate columns. This has 2 almost-immediate corollaries: (1) LSEP should separate rows; (2) the scope of the columns should be limited in some way, with PSEP being the obvious choice. As I noted, it has exactly the same minimum implementation requirements as HT, but it also gives the renderer the *option* of doing nicer alignment, if it wants. So it needn't be complex. It is certainly possible that the foundation on which this rests ("the concept of vertically-aligned columns is an important part of plain text") is just not true---in which case trying to nail down the semantics of HT seems like a logically impossible task, as it shares that foundation. /| o o o (_|/ /| (_/ 8-Jul-99 1:18:01-GMT,1533;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id VAA06626 for ; Wed, 7 Jul 1999 21:18:01 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id SAA323710 ; Wed, 7 Jul 1999 18:11:20 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA10198; Wed, 7 Jul 99 18:00:38 -0700 Message-Id: <9907080100.AA10198@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-Uml-Sequence: 8557 (1999-07-08 00:58:57 GMT) From: "Christopher J. Fynn" To: Unicode List Date: Wed, 7 Jul 1999 17:58:56 -0700 (PDT) Subject: RE: Plain text: Amendment 1 Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by watsun.cc.columbia.edu id VAA06626 > From: Jonathan Coxhead wrote: > There various ways to express the semantic "advance to next column" > in plain text, chiefly: > > ---insert enough spaces to make the lines line up; Doesn't this assume fixed, or at least known width, glyphs? And do you take into account non spacing glyphs? What about scripts that can be written vertically or horizontally? Scripts where the glyph form representing a character(and thus its width) is dependant on context? Making columns line up by inserting spaces is not a good idea. - Chris 8-Jul-99 6:11:55-GMT,1954;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id CAA06425 for ; Thu, 8 Jul 1999 02:11:54 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id XAA188368 ; Wed, 7 Jul 1999 23:02:44 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA12174; Wed, 7 Jul 99 22:52:19 -0700 Message-Id: <9907080552.AA12174@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8562 (1999-07-08 05:52:07 GMT) From: Edward Cherlin To: Unicode List Date: Wed, 7 Jul 1999 22:52:06 -0700 (PDT) Subject: Re: Plain Text At 09:50 -0700 7/5/1999, Frank da Cruz wrote: >[Ed wrote...] [snip] >> How do we deal with delimited database transfer files with a fixed >> limit on line length? >> >I don't see how these files would be affected. You can put line separators >in them if you want, or leave them out. So the line length limit is an option? [snip] >> To summarize your answer to my objections, we are defining a new format >> independent of previous conventions, in which we can specify usage of the >> minimal set of formatting characters regardless of usage in text files of >> 7-bit ASCII and 8-bit character sets of any kind, while allowing for a few >> variant flavors of text, such as preformatted, reflowable, and >> database. >> >Yes. > >> To which I add, that we can specify a portable implementation, >> too, and not have to wait for computer and OS vendors to get on board. >> >Double yes. > >- Frank -- Edward Cherlin edward.cherlin.sy.67@aya.yale.edu "It isn't what you don't know that hurts you, it's what you know that ain't so."--Mark Twain, or else some other prominent 19th century humorist and wit 11-Jul-99 8:24:16-GMT,1472;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id EAA09130 for ; Sun, 11 Jul 1999 04:24:15 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id BAA258092 ; Sun, 11 Jul 1999 01:20:39 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA27285; Sun, 11 Jul 99 01:11:55 -0700 Message-Id: <9907110811.AA27285@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-Uml-Sequence: 8594 (1999-07-11 08:11:44 GMT) From: Edward Cherlin To: Unicode List Date: Sun, 11 Jul 1999 01:11:42 -0700 (PDT) Subject: MIME text/plain (was Re: Plain Text) At 07:23 -0700 7/6/1999, John Cowan wrote: [snip] >The corresponding MIME type is "text/plain; charset=utf-8" or >"... utf-16". > >Anything else should have a different MIME type or at least >different parameters. [snip] How is "text/plain" defined? What does it specify about line lengths, word wrap, fixed vs. proportional fonts, line end characters, and line and paragraph separators? -- Edward Cherlin edward.cherlin.sy.67@aya.yale.edu "It isn't what you don't know that hurts you, it's what you know that ain't so."--Mark Twain, or else some other prominent 19th century humorist and wit 11-Jul-99 16:23:46-GMT,1815;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id MAA10650 for ; Sun, 11 Jul 1999 12:23:45 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id JAA244866 ; Sun, 11 Jul 1999 09:20:02 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA27873; Sun, 11 Jul 99 09:07:23 -0700 Message-Id: <9907111607.AA27873@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 8595 (1999-07-11 16:07:07 GMT) From: Jungshik Shin To: Unicode List Date: Sun, 11 Jul 1999 09:07:06 -0700 (PDT) Subject: Re: MIME text/plain (was Re: Plain Text) > At 07:23 -0700 7/6/1999, John Cowan wrote: > [snip] > >The corresponding MIME type is "text/plain; charset=utf-8" or > >"... utf-16". > > > >Anything else should have a different MIME type or at least > >different parameters. Can I propose that everyone on this mailing list stop sending messages in "pre-Unicode" encodings(like ISO-8859-1) and begin sending her/his messages with non-US-ASCII characters in UTF-8(well, US-ASCII only message also qualifies for UTF-8 as everybody knows)? Isn't it funny that people on the Unicode mailing list send messages in "legacy" encodings like ISO-8859-1(by far the most frequently used encoding in the list which is not UTF-8 other than US-ASCII which can be labelled as UTF-8)? I know this will for sure lead to some inconveniences for some people(perhaps quite many of us), but aren't we suppose to be an exampla case in promoting as rapid and wide adoption of Unicode as possible? Jungshik Shin 12-Jul-99 14:22:13-GMT,2758;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id KAA03664 for ; Mon, 12 Jul 1999 10:22:12 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id HAA89920 ; Mon, 12 Jul 1999 07:12:18 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA01632; Mon, 12 Jul 99 06:58:06 -0700 Message-Id: <9907121358.AA01632@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Uml-Sequence: 8606 (1999-07-12 13:57:55 GMT) From: John Cowan To: Unicode List Date: Mon, 12 Jul 1999 06:57:53 -0700 (PDT) Subject: Re: MIME text/plain (was Re: Plain Text) Content-Transfer-Encoding: 7bit Edward Cherlin wrote: > How is "text/plain" defined? What does it specify about line lengths, word > wrap, fixed vs. proportional fonts, line end characters, and line and > paragraph separators? RFC 2046, section 4.1 ff., is authoritative: # Plain text does not provide for or allow # formatting commands, font attribute specifications, processing # instructions, interpretation directives, or content markup. Plain # text is seen simply as a linear sequence of characters, possibly # interrupted by line breaks or page breaks. Plain text may allow the # stacking of several characters in the same position in the text. # Plain text in scripts like Arabic and Hebrew may also include # facilities that allow the arbitrary mixing of text segments with # opposite writing directions. # # [...] # # The canonical form of any MIME "text" subtype MUST always represent a # line break as a CRLF sequence. Similarly, any occurrence of CRLF in # MIME "text" MUST represent a line break. Use of CR and LF outside of # line break sequences is also forbidden. # # This rule applies regardless of format or character set or sets # involved. # # NOTE: The proper interpretation of line breaks when a body is # displayed depends on the media type. In particular, [...] it is # appropriate to treat a line break as a transition to a new line when # displaying a "text/plain" body [...]. It should not be # necessary to add any line breaks to display "text/plain" correctly # [...]. There is no talk of fonts or paragraphs, and the "NOTE:" paragraph suggests that word (or non-word) wrapping is inappropriate. -- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau, Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge / Politzer 15-Jul-99 13:52:50-GMT,7678;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id JAA13840 for ; Thu, 15 Jul 1999 09:52:50 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id GAA256534 ; Thu, 15 Jul 1999 06:44:14 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA22770; Thu, 15 Jul 99 06:36:01 -0700 Message-Id: <9907151336.AA22770@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-Uml-Sequence: 8664 (1999-07-15 13:35:32 GMT) From: "Reynolds, Gregg" To: Unicode List Cc: unicode@unicode.org Date: Thu, 15 Jul 1999 06:35:31 -0700 (PDT) Subject: RE: Arabic - Alef Maqsurah Dear Ken, Thanks very much for your thoughtful reply. A few points before I head back into the salt mines: > -----Original Message----- > From: kenw@sybase.com [mailto:kenw@sybase.com] > Sent: Wednesday, July 14, 1999 8:07 PM > To: greynolds@datalogics.com > Cc: unicode@unicode.org; kenw@sybase.com > Subject: RE: Arabic - Alef Maqsurah > > > > this discussion. My personal project is to model the > working of Arabic > > texts, so my loyalties are to the language, not to legacy software. > > Here, "legacy" software includes, of course, Office 2000, > which is only > just now becoming available, with Unicode-based Arabic as part of the > package. That's pretty new to already be scorned as "legacy software." It's not that I scorn legacy software; that would be like scorning gravity (or God, where a certain software maker is concerned). I just think the natual language, and not legacy software ("encoding designs" would be a better term here) should be the yardstick. > > misundertand it. Much of the confusion (IMHO) is due > simply to loose > > terminology. > > We keep working on the terminology, and have tightened up a lot in > the new version 3.0 (forthcoming). --Although, unfortunately, > this area > of input methods is not scheduled for any new additions or > clarifications at the moment. > > But in my opinion, most of the confusion about such issues and the > Unicode Standard are not really the result of loose terminology, but Yes; I should have said "unfinished" or the like instead of loose; I don't mean to imply the editors are slackers. > > > > I think it probably does turn up for many languages - > remember my concern is > > with encoding texts in the language, not the script. It's > not a question of > > essentialism (whatever that is) but peculiarlism. (In two > words: clitics > > and non-concatenative morphology.) > > Ah, so it *is* an issue of Arabic essentialism. The > morphological (or whatever-- > fill in your list of attributes here) essence of Arabic is > different from > that of other languages; therefore it must be treated in an > essentially > different way in encoding (or whatever--fill in your list here) to be > handled correctly. > One request: please let's not resort to such labels. "Ism-ism" in my opinion almost always obscures more than it enlightens. As to the specifics of your comment, I am emphatically not making the case that Arabic has some sort of mystical essence that deserves some kind of special treatment. On the contrary my point is precisely that it and many other languages that do not share the linguistic features that make e.g. English amenable to digital representations already receive a kind of special treatment, in that they must be encoded using a strategy designed for one class of languages. I think this situation could be remedied to a certain extent without breaking unicode. > > Here you are talking about the lemmatizing problem for search > algorithms. > This is, indeed, very sensitive to the morphology and morphosyntactic > structures of particular languages. Implementers of > multilingual search > engines are well aware of this problem and must tailor their > algorithms > to deal with the particular morphologies they encounter. But this begs the question. They don't encounter particular morphologies; they encounter particular encodings. Encodings, natural and artificial, always reflect some theory of language. Change the encoding and you change the problem. > > Yes and yes. You just cannot build morphological structure > into a practical > character encoding -- especially one which has to be > universal, and applicable > to representation of text in any language, living or dead, in > any script. On the contrary, you cannot *not* build morphological structure into an encoding. Unicode already does: lexemes are built by concatenating text atoms. Works great for English, not so great for e.g. Arabic. How else can one explain the space "character" as a positive element? Even for Arabic Unicode accomodates some level of morphological intelligence: "contextual shaping" encodes morphology (prosodic word boundary). Every "natural" encoding of language into visual form does the same to some extent. It's not a question of whether, but of how much. > > Ah, but here is where your basic approach, as it applies to > the Unicode > Standard, breaks down. The Arabic *script* is what is encoded in the > standard. The Arabic script is used to represent text in hundreds of > non-Semitic languages, from Urdu, to Malay, to Uighur, to > Persian, to Pashto, to > Swahili, as well as the Semitic core languages. Those > languages run the > complete gamut of morphological types. You can't just reconstruct the > encoding of the Arabic script in Unicode to tailor it to the Arabic > *language* morphology, when it can and is used to represent > text in all > the other languages, including many Indo-European languages, > for that matter. Understood, but my view is that this is where Unicode itself gets a little confused. Does it or does it not encode presentational (visual) form? Arabic presentational forms (by which I mean all letterforms used in writing) are indeed used in many languages from different families, but do these presentational forms share the same character semantics across languages? I sincerely doubt it. So an encoding that works across languages must sharply distinguish between character semantics and presentational form. Which gets us back to grammatical encoding. BTW, in one of your earlier notes you pointed out that handwriting sequence is the preferable guide to implementing input methods. This is the alternative: grammatical sequence. I'll put together some examples of what I mean this weekend. > > > The argument I will make (eventually; it's > > quitting time just now) is that such structural information > is rightfully > > part of the standard encoding; the intelligence should be moved from > > specialized logic in software and embedded in the text. > > Nope. > > You can always embed it in specialized text devoted to the > Arabic language > in particular (either through markup or your own morphologically-based > encoding in private use space), but that is not the design point of > the Unicode Standard for plain text representation. > Not to be provocative, but isn't it interesting how "plain text" just seems to work for some languages and not for others? I don't want you to misconstrue my remarks as a mere whine about the woeful state of the world; I've actually got some concrete suggestions that I'll post this weekend along with some more background info. I think they're technically feasable, which probably dooms them ;.) Thanks again, Gregg 19-Jul-99 9:40:54-GMT,3602;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id FAA04765 for ; Mon, 19 Jul 1999 05:40:53 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id CAB193762 ; Mon, 19 Jul 1999 02:35:23 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA06222; Mon, 19 Jul 99 02:24:15 -0700 Message-Id: <9907190924.AA06222@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 X-Uml-Sequence: 8698 (1999-07-19 09:23:56 GMT) From: Markus Kuhn To: Unicode List Date: Mon, 19 Jul 1999 02:23:55 -0700 (PDT) Subject: Re: Apostrophes, quotation marks, keyboards and typography Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by watsun.cc.columbia.edu id FAA04765 Jonathan Rosenne wrote on 1999-07-18 22:14 UTC: > 1. this is one of the reasons for text in HTML. The processor can > substitute the correct character. > > In general, any word processor should allow the user to style the text as a > quotation, rather than require him to type typographical characters. I personally am not convinced that higher layer protocols should be used to handle punctuation. This completely violates by concept of plain text, and the existing practice of using higher layer protocols here clearly just derives from the limitations of ASCII, an artifact of an era that we are hopefully about to leave behind us. Higher layer protocols such as SGML are fine for things like font selection and other formatting and logical structuring, but quotation marks and other punctuation are too much part of the raw text than that I would like to see them handled via hacks such as . Higher layer protocols should in my opinion not represent the actual textual content of the text, but give only auxiliary structuring and representation hints. Therefore I don't like to see markup for quotation marks, just as I don't like the idea to have to markup conditional clauses, sentences, and perhaps even paragraphs (not sure about the last one though). > 2. The situation for Hyphen-Minus is quite similar. Agreed, it is equally confusing and keyboard entry conventions should be carefully standardized here as well. Mark Davis wrote on 1999-07-18 17:47 UTC: > There seems to be some misunderstanding. "The Unicode Standard, Version > 2.1" gives the following text (see > http://www.unicode.org/unicode/reports/tr8.html#3.6 Apostrophe Semantics > Errata): > > U+02BC MODIFIER LETTER APOSTROPHE is preferred where the character > is to represent a modifier letter (for example, in transliterations > to indicate a glottal stop.) In the latter case, it is also referred > to as a letter apostrophe. > > U+2019 RIGHT SINGLE QUOTATION MARK is preferred where the character is to > represent a punctuation mark, as in "We've been here before." In the > latter case, U+2019 is also referred to as a punctuation apostrophe. Excellent! I missed that 2.1 correction, and I am delighted to see that this was already fixed nicely. So U+02BC is one thing less to worry about and the Microsoft Word practice actually does conform to the standard. Thanks for the reply. So the rest is really up to the keyboard standards community to fix. Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: 20-Jul-99 9:17:47-GMT,3045;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id FAA17686 for ; Tue, 20 Jul 1999 05:17:47 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id CAA284110 ; Tue, 20 Jul 1999 02:13:09 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA11429; Tue, 20 Jul 99 01:59:08 -0700 Message-Id: <9907200859.AA11429@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii X-Uml-Sequence: 8713 (1999-07-20 08:58:55 GMT) From: Markus Kuhn To: Unicode List Date: Tue, 20 Jul 1999 01:58:53 -0700 (PDT) Subject: Re: Unicode in Source Code (Ada95 and Java) Murray Sargent wrote on 1999-07-20 01:16 UTC: > An example where nonASCII identifiers is really useful is in coding up > mathematical formulae that contain Greek letters. For example, a program is > much more readable if you use U+3B1 for alpha rather than spelling out the > name alpha. Similarly U+3C0 for pi. Hopefully C++ will follow Java's > excellent example and allow Unicode alphabetics in variable names. Ada95 is even younger than Java and it is the first ISO standardized programming language that was designed after the publication of ISO 10646-1. Of course, Ada95 - like Java - also uses UCS as its internal character set. However, the Ada95 revision team has explicitly decided not to follow the path of Java and they only allowed the Latin-1 letters in identifiers. The Ada community is very concerned about safety issues and about the readability of source code, because Ada is widely deployed today in safety critical environments (most avionics software is written in Ada for instance). Unicode contains a quite large number of characters that are difficult - if not impossible - to distinguish visually. A safety requirement for Ada identifiers is that it must be easy for human readers to decide whether two identifiers are different or equal. The presence of Unicode characters such as U+00D0, U+0110 and U+0189 introduces a lot of potential hazards that are best avoided by not allowing a too rich repertoires of characters in object identifiers. Note however that the Ada95 standard does allow implementations to offer "non-standard" optional modes that do allow additional UCS characters in identifiers. Have a look at: Ada95 Reference Manual, ISO/IEC 8652:1995(E), Section 2.1: Character Set, http://wuarchive.wustl.edu/languages/ada/userdocs/docadalt/rm95/02.htm http://www.cl.cam.ac.uk/~mgk25/ada.html Markus (who decided to use Ada95 for his PhD implementation project, because the language is at least as nice and modern as Java, but its compilers produce far more efficient native machine code.) -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: 23-Jul-99 3:13:02-GMT,4775;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA08993 for ; Thu, 22 Jul 1999 23:13:01 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id UAA199974 ; Thu, 22 Jul 1999 20:07:56 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA22187; Thu, 22 Jul 99 19:56:37 -0700 Message-Id: <9907230256.AA22187@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8854 (1999-07-23 02:56:28 GMT) From: Gianni Mariani To: Unicode List Cc: unicode@unicode.org Date: Thu, 22 Jul 1999 19:56:27 -0700 (PDT) Subject: RE: The future of UTF-8 The issue I have with BOM's is that if I have 2 "plain text" files and I do this kind of operation: type appendfile >> oldfile It's not guarenteed to work unless the consuming application processes multiple BOMS which in that case it renders utf-16 and ucs4 fully stateful from a consumer application's p.o.v. albeit with only two states since it needs to filter all incoming characters. The above operation works with all other "plain text" files including utf-8 without any "stateful" transitions. This kind of operation is really not uncommon. Take log files. If I have two co-operating applications of different endianness machines writing to the same log where one machine is big endian and one little endian, then the application needs to care about endianness when it's writing utf-16 but not so with utf-8. I can probably come up with some more examples. When utf-16 became born, there was no real reason to go with it because at that point, you have all the problems with multibyte encodings and most of the programming community still like using 8 bit chars, we still fight this inside MS with libs ported to CE. As you can tell, these are my opinions and not necessarily that of my employer. Anyhow, the other issue is that many applications that process wide chars are not utf-16 aware, while any internationalized 8 bit application that multibyte aware is a whole lot easier to port to Unicode using utf-8. Where time is money, it's virtually impossible to justify spending the sort of time that's required to go to utf-16 when utf-8 can be just as effective. It's also relativly easy to write a string class that has both a utf-16 and utf-8 "view" of a string making it virtually unnessasary to do an either-or decision so you get to pick the best of both worlds. So, apologies for my earlier snappy comments, it wasn't intended that way, although the MS stock price may have had somthing to do with it :)) As always, highest Regards G -----Original Message----- From: kenw@sybase.com [mailto:kenw@sybase.com] Sent: Thursday, July 22, 1999 1:56 PM To: Unicode List Cc: unicode@unicode.org; kenw@sybase.com Subject: RE: The future of UTF-8 Gianni, > If you need to process BOM's (10646 signatures) it is then stateful. How so? The Unicode character encoding itself is not stateful. The UTF-16 encoding form is not stateful. The UTF-16BE and UTF-16LE UTF's (serializations) are not stateful. UTF-16 as a UTF (serialization) is ambiguous as to the byte order of the serialization. That ambiguity is resolved in one of several ways: 1. A higher order protocol. At which point, the data processing is not stateful. 2. By detection of a BOM. When the BOM is detected and interpreted, the data processing of the textual content is not stateful. 3. By heuristics. And while the heuristic processing itself might be stateful, once the outcome of the heuristic provides an answer for the byte order, subsequent processing is not stateful. And this is in effect no different that any heuristic applied to detect character set, whether that character set itself is a stateful encoding or not. The term "stateful", as applied to character encodings, usually is referring to architectures like ISO 2022, where the state induced by an escape sequence must be retained to interpret all subsequent bytes, until encountering another escape sequences changes the state, and thus the interpretation of the next run of bytes. That is quite different from determination of the byte polarity "state" on a data type before processing it. If that were the case, then you could equally well claim that processing of any integral datatype larger than a byte is "stateful" in a cross-platform environment. But that is diluting the term "stateful" in the character encoding context down to the point where it has nothing in common with its intended applicability. --Ken 23-Jul-99 3:55:37-GMT,1968;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA14427 for ; Thu, 22 Jul 1999 23:55:36 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id UAA38340 ; Thu, 22 Jul 1999 20:49:29 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA22953; Thu, 22 Jul 99 20:36:23 -0700 Message-Id: <9907230336.AA22953@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-Uml-Sequence: 8856 (1999-07-23 03:36:13 GMT) From: "Paul Dempsey (Exchange)" To: Unicode List Cc: unicode@unicode.org Date: Thu, 22 Jul 1999 20:36:12 -0700 (PDT) Subject: RE: The future of UTF-8 > -----Original Message----- > From: Gianni Mariani [mailto:gianni@corp.webtv.net] > > The issue I have with BOM's is that if I have 2 "plain text" > files and I do this kind of operation: > > type appendfile >> oldfile > > It's not guarenteed to work unless the consuming application > processes multiple BOMS ... The reason this is not guaranteed to work is because the command processor that's doing "type" with redirection doesn't know about the file formats. It's the command processor that's defective, NOT the use of BOM/file signature. It is a trivial matter to write a process that correctly concatenates files with BOMs. I'm sure that someone on this list can promptly cough up a few lines of perl that does it. Your argument is not much different than expecting to be able to do a byte-wise concatenation of a Shift+JIS file with a codepage 1252 (Windows Western) file. These are both "plain text" files, but it fails miserably. I think that transparent byte-wise concatenation of files is a minor consideration when designing the file format. --- Paul 23-Jul-99 16:18:36-GMT,1998;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id MAA08803 for ; Fri, 23 Jul 1999 12:18:34 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id IAA206820 ; Fri, 23 Jul 1999 08:59:56 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA28254; Fri, 23 Jul 99 08:47:36 -0700 Message-Id: <9907231547.AA28254@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Uml-Sequence: 8874 (1999-07-23 15:47:23 GMT) From: To: Unicode List Date: Fri, 23 Jul 1999 08:47:20 -0700 (PDT) Subject: RE: The future of UTF-8 On Thu, 22 Jul 1999, Paul Dempsey (Exchange) wrote: > > -----Original Message----- > > From: Gianni Mariani [mailto:gianni@corp.webtv.net] > > > > The issue I have with BOM's is that if I have 2 "plain text" > > files and I do this kind of operation: > > > > type appendfile >> oldfile > > > > It's not guarenteed to work unless the consuming application > > processes multiple BOMS ... > > The reason this is not guaranteed to work is because the command processor > that's doing "type" with redirection doesn't know about the file formats. > It's the command processor that's defective, NOT the use of BOM/file > signature. And if oldfile happens to be a sequential access file, a tape for example, the command processor rewinds to the beginning of the file, reads the BOM if it exists, seeks back to the end of the file, then somehow arranges to signal to the application the format that it should write its standard output should be? Even if you can avoid changing the individual applications by sticking a byte-flipper downstream of the "write" system call, determining the file format via a BOM is not always going to be a reasonable thing to do. -john 24-Jul-99 19:41:46-GMT,5255;000000000011 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id PAA16914 for ; Sat, 24 Jul 1999 15:41:45 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id MAA199992 ; Sat, 24 Jul 1999 12:34:55 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA06999; Sat, 24 Jul 99 12:26:08 -0700 Message-Id: <9907241926.AA06999@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 X-Uml-Sequence: 8897 (1999-07-24 19:25:57 GMT) From: Markus Kuhn To: Unicode List Date: Sat, 24 Jul 1999 12:25:55 -0700 (PDT) Subject: Re: Support for symbol fonts Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by watsun.cc.columbia.edu id PAA16914 Erik van der Poel wrote on 1999-07-24 16:53 UTC: > > http://www.cl.cam.ac.uk/~mgk25/download/ucs-fonts* > > [...] The Times Roman font covers also the Adobe Symbol encoding. > > Just curious, but which ISO 10646 code points did you choose for the > Adobe Symbol characters that are not in 10646? For example: > > F8FC FC # RIGHT CURLY BRACKET TOP # bracerighttp (CUS) > F8FD FD # RIGHT CURLY BRACKET MID # bracerightmid (CUS) > F8FE FE # RIGHT CURLY BRACKET BOTTOM # bracerightbt (CUS) I did the conversion of the Adobe fonts to ISO 10646-1 based on the Adobe glyph names found in the fonts (because this catches also unencoded glyphs that are hidden in many of the X11 BDF files but unavailable under any ISO 8859-1 code), based on the following Adobe table, which maps Postscript glyph names to UCS: http://partners.adobe.com/supportservice/devrelations/typeforum/glyphlist.txt I dropped all characters from the font which are neither in the above list nor have already a uniXXXX name. This includes the above bracket fragment characters, for which there exists no Unicode equivalent. By the way, the bracket fragments were in Frank da Cruz's terminal symbol proposal, for which I completely forgot to scan and publish a number of exhibits that Frank has sent me as a basis for further discussion. Will be on the web next week. Sorry for the delay. I could probably put the Adobe Symbol bracket fragments into the private use section, as suggested by the Adobe mapping tables found on the Unicode ftp server. However, I do not like these particular characters anyway. I believe that variable size parentheses, braces and brackets should really be drawn using graphics primitives (spline and line terminates areas) from a simple algorithmic description in the style sheet language. Putting them together from font pieces is highly non-portable and also does not give you the same quality that an algorithmic specification could provide. If you really want to have these bracket parts for MathML, I do urge you to reconsider this entire approach of relying on the font here. The use of special math building blocks in TeX was *THE* primary reason of why TeX is hardly ever used with any other fonts than Knuth's Computer Modern, because all others lack the bracket parts is exactly the alignment in which TeX requires them. If TeX had used graphics primitives to draw variable sized parentheses and square roots, we could much more easily use any arbitrary commercial of the shelf font with TeX in mathematical texts. Please do not repeat the same mistake again and lock the math layout functionality to a single specific font. Please take the variable shapes from the style sheet and not from the font! > I believe Frank wanted to know about legacy fonts so that Mozilla could > try to support those in case the user has not installed the new 10646 > fonts yet. His question arose from a discussion of MathML support in > Mozilla. As I said, the Adobe Symbol font is *very* small and will lead to more frustration than satisfaction among MathML users. It is so small that it is not necessarily better than nothing. Potential MathML users are today TeX users. They will have the expectation that at least all TeX symbols are available, so a real ISO 10646-1 font is clearly the way to go here. > > The X server will be extended by a simple > > conversion function that can generate on-the-fly legacy encodings such > > as CP1252, KOI-8, CP1252, JIS X 208, etc. from the ISO 10646-1 encoded > > source fonts. > > I'm pretty sure you are aware of the Han unification issues, but I think > you would be more successful if you treat CJK with care. I.e. when > making a JIS X 0208 font available, make sure the glyphs are > "Japanese-style" and not Chinese. Oh yes, we have at least one Japanese member in the XFree86 team who is quite vocal about these issues. :) I have started to use the convention that ADD_STYLE_NAME is set to "ja" in the XLFD of Japanese UCS fonts, such that we could indeed restrict the set of fonts that we advertise as being available under a JIS encoding. Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: 24-Jul-99 20:12:18-GMT,1864;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA23315 for ; Sat, 24 Jul 1999 16:12:18 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id NAA42766 ; Sat, 24 Jul 1999 13:06:14 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA07292; Sat, 24 Jul 99 12:50:24 -0700 Message-Id: <9907241950.AA07292@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 8898 (1999-07-24 19:50:14 GMT) From: Frank da Cruz To: Unicode List Cc: Unicode List Date: Sat, 24 Jul 1999 12:50:12 -0700 (PDT) Subject: Re: Support for symbol fonts Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK, wrote: > By the way, the bracket fragments were in Frank da Cruz's terminal > symbol proposal, for which I completely forgot to scan and publish a > number of exhibits that Frank has sent me as a basis for further > discussion. Will be on the web next week. Sorry for the delay. > Better late than never :-) I also promised to send a Unicode plain text proposal, but then real life intruded. It's not forgotten. Meanwhile, there might be some hope for the bracket pieces in the math plain-text work. Again, the rationale is to be able to construct mathematical expressions on character-cell devices where we don't have GUI fonts and rendering engines (primarily when emulating terminals and printers that do this in applications that are Unicode-based). In this case we don't have to worry too much about alignment since these devices are monospaced. Obviously bracket pieces are not the preferred method for rendering math in the GUI environment. - Frank 28-Jul-99 3:05:35-GMT,5113;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA05193 for ; Tue, 27 Jul 1999 23:05:34 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id TAA195056 ; Tue, 27 Jul 1999 19:59:13 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA18292; Tue, 27 Jul 99 19:43:57 -0700 Message-Id: <9907280243.AA18292@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-Uml-Sequence: 8925 (1999-07-28 02:43:35 GMT) From: Jonathan Rosenne To: Unicode List Cc: Unicode List Date: Tue, 27 Jul 1999 19:43:33 -0700 (PDT) Subject: Re: Apostrophes, quotation marks, keyboards and typography Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by watsun.cc.columbia.edu id XAA05193 I don't think this violates the idea of "plain text". Plain text is an interchange concept, while we are talking about input methods. Once you wish to allow plain text to include more than the small number of characters that can be conveniently provided by the keyboard you have to provide more sophisticated input methods. Quotation marks are just one case, Unicode contains many more characters an author may wish to use that are not in his keyboard. Hexadecimal is not a solution for the general public. I suggest we take a look at how things used to be done before computers. In those ancient times, one would give a printer a manuscript (= hand written paper), which was marked up either by the author or by an editor, and the printer would set the text in print. This was the grandfather of mark-up languages, later standardized in SGML. In those manuscripts, the text could not indicate precisely various typographic distinctions, such as quotation marks, and in those cases markup was used. It is much more user friendly to have to write text, or to select the text and click on a "quotation" menu item, indicating intent, rather than <&lsqm>text<&rsqm> or something similar, or some fancy keyboard combination, in which the author has to specify the precise implications of his intent. How will mathematical symbols be entered in plain text? Jony At 02:23 19/07/99 -0700, Markus Kuhn wrote: >Jonathan Rosenne wrote on 1999-07-18 22:14 UTC: >> 1. this is one of the reasons for text in HTML. The processor can >> substitute the correct character. >> >> In general, any word processor should allow the user to style the text as a >> quotation, rather than require him to type typographical characters. > >I personally am not convinced that higher layer protocols should be used >to handle punctuation. This completely violates by concept of plain >text, and the existing practice of using higher layer protocols here >clearly just derives from the limitations of ASCII, an artifact of an >era that we are hopefully about to leave behind us. Higher layer >protocols such as SGML are fine for things like font selection and other >formatting and logical structuring, but quotation marks and other >punctuation are too much part of the raw text than that I would like to >see them handled via hacks such as . Higher layer protocols should in >my opinion not represent the actual textual content of the text, but >give only auxiliary structuring and representation hints. Therefore I >don't like to see markup for quotation marks, just as I don't like the >idea to have to markup conditional clauses, sentences, and perhaps even >paragraphs (not sure about the last one though). > >> 2. The situation for Hyphen-Minus is quite similar. > >Agreed, it is equally confusing and keyboard entry conventions should be >carefully standardized here as well. > >Mark Davis wrote on 1999-07-18 17:47 UTC: >> There seems to be some misunderstanding. "The Unicode Standard, Version >> 2.1" gives the following text (see >> http://www.unicode.org/unicode/reports/tr8.html#3.6 Apostrophe Semantics >> Errata): >> >> U+02BC MODIFIER LETTER APOSTROPHE is preferred where the character >> is to represent a modifier letter (for example, in transliterations >> to indicate a glottal stop.) In the latter case, it is also referred >> to as a letter apostrophe. >> >> U+2019 RIGHT SINGLE QUOTATION MARK is preferred where the character is to >> represent a punctuation mark, as in "We've been here before." In the >> latter case, U+2019 is also referred to as a punctuation apostrophe. > >Excellent! I missed that 2.1 correction, and I am delighted to see that >this was already fixed nicely. So U+02BC is one thing less to worry >about and the Microsoft Word practice actually does conform to the >standard. Thanks for the reply. > >So the rest is really up to the keyboard standards community to fix. > >Markus > >-- >Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK >Email: mkuhn at acm.org, WWW: > 22-Aug-99 19:27:34-GMT,5598;000000000001 Return-Path: Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id PAA28045 for fdc; Sun, 22 Aug 1999 15:27:34 -0400 (EDT) Date: Sun, 22 Aug 1999 15:27:34 -0400 (EDT) From: Frank da Cruz Message-Id: <199908221927.PAA28045@watsun.cc.columbia.edu> To: fdc@watsun.cc.columbia.edu Path: newsmaster.cc.columbia.edu!panix!howland.erols.net!news.maxwell.syr.edu!nntp.ntr.net!remarQ60!rQdQ!supernews.com!remarQ.com!corp.supernews.com!not-for-mail From: "John E. Malmberg" Newsgroups: comp.os.vms Subject: Re: TEXT is the format for comp.os.vms (was: Help - crashing) Date: Sun, 22 Aug 1999 11:42:08 -0500 Organization: Posted via Supernews, http://www.supernews.com Lines: 95 Message-ID: References: <1c2901beec05$b19387e0$020a0a0a@wizard.xile.realm> <37BF93E6.BDA69B0@hct.ac.ae> <37BFB95C.77B1C1F7@hct.ac.ae> <1999Aug22.092016.1@eisner> X-Complaints-To: newsabuse@supernews.com X-Newsreader: Microsoft Outlook Express 4.72.3110.5 X-MimeOLE: Produced By Microsoft MimeOLE V4.72.3110.3 Xref: newsmaster.cc.columbia.edu comp.os.vms:217594 Text-Only is the format of most newsgroups. To make things perfectly clear, MIME and it's friends are not wanted in newsgroups for the following reasons. And it has nothing to do with OpenVMS. It is for them to be of the most use, and that includes accommodating those users with "backward" technology. Posters that violate this Netiquette will either be politely reminded of the conventions or will be ignored. This is not a moderated newsgroup/mailing list, so that is the only means of enforcing the convention. In the past, most posters have taken the hint from one polite reminder, and a few needed to learn how to adjust their mail/news sending program. And as typical in most societies, many people will ignore RUDE behavior, hoping that the person will realize their gaffe, or that someone else will explain things. 1. The contents of the newsgroups are automatically collected and archived for later searches. In many cases the archiver simply collects the data and jams it into one file based on a size limit or an age limit. There are multiple archives, some public ones, and some private ones that force you to view advertisements. Most of these were put in place before MIME messages were considered, or any type of attachment for that matter. Some of these archivers ignore ALL attachments, but most of the ones I have seen simply jam the attachment to the end of the message. Since it is good Netiquette to search these archives before posting a question, it becomes a royal pain to open a archive to get the information the search engine says is in it, and find the pages of hexdumps that typically follow some filename xxxxxx.VCF in it. The HTML formatted stuff is also hard to read. Since it is stuffed between the normal plain text stuff, even a MIME enabled reader will not translate it. And there is another format that puts "=20" at the ends of all the lines along with some other random stuff. 2. Many users of Usenet do not have access to a newsreader. Many times because this access is blocked at the corporate firewall. They also do not want to filter out their important messages from the volume of messages that a newsgroup can generate. So they set their mail delivery to DIGEST mode. In DIGEST mode, all MIME stuff gets delivered as described in point 1. 3. Many Corporate E-Mail systems still can not handle MIME. The system that I use at work just got the capability early this year. Prior to that, a MIME message was treated as follows. First I would receive a message with a title and a blank body. Then I would receive a message with no title that a MIME encoded message had been received. A bit latter, each attachment would show up in a separate message with no title. Other messages can be randomly interposed between them. 4. After the Melissa adventure, many corporate sites are putting in E-mail filters that will block bad messages. The first pass was to stop all messages with the indicated title. That of course is not sufficient for long term. There are reports in the trade press, that some companies are returning to sender any HTML formatted document as a precaution. The intended receiver may get a notice of the rejection just in case the E-Mail is important. 5. Given the state that E-Mail is in today, especially in a corporate environment, it would not look good to send MIME stuff to a person you want to be in a business relationship with, if you do not absolutely know that their mail software can handle it. There are still many users of IBM OFFICE-VISION getting their mail on 3270 terminals. This type of behavior can damage a business relationship. Especially if the customer is running OUTLOOK and a sales / marketing critter mails a HTML document that contains a virus. The bottom line is that it is RUDE of the sender to assume that the recipient can receive anything but plain text. The mime stuff is great, once you have established that the recipient can handle it. The VCF stuff to verify a sender's identity should be reserved for those receivers that request it. The ones that do not request it can not use it, and it is useless garbage to them. It is just wasting bandwidth and storage space on mail and news servers. -John By the way, MIME can include PostScript, REGIS, and SIXEL. My OpenVMS systems can handle these but I know that most M$soft can not, and most UN*X can not handle all three. 23-Aug-99 16:13:27-GMT,4147;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id MAA00633 for ; Mon, 23 Aug 1999 12:13:26 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id JAA250042 ; Mon, 23 Aug 1999 09:08:52 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA02285; Mon, 23 Aug 99 09:00:07 -0700 Message-Id: <9908231600.AA02285@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 9372 (1999-08-23 15:59:56 GMT) From: peter_constable@sil.org To: Unicode List Date: Mon, 23 Aug 1999 08:59:55 -0700 (PDT) Subject: Re: New phonemic writing system and IPA usage Content-Transfer-Encoding: 7bit >>>The reason English is interesting to learn is not any fundamental property >of English, more that there is a huge amount of _written_ information and >literate people, in English. Changing English orthography would break >that. > Changing English orthography would break >that. > > >(M.E.) regularizing it with minor corrections > according to Wijk's very sensible scheme would not. > > (Peter) But making as drastic a change as to adopt CC would! (JM) That's a matter of opinion! JoAnne, how can you say that this is a matter of opinion? As soon as a generation grows up learning to read and write English using only CC, the majority Kwill only have access to recent documents; documents in the old orthography won't spontaneously transform themselves. Humanity has a *very, very, very huge* investment in published and unpublished documents in English using the existing orthograhy, and there is a probability of 0.00 +/- 0% that we want to throw that away, or that we want to limit access to that information to a minority that chose to learn the old orthography in addition to the new CC-based standard. I don't think even you can disagree with that. And if we will want to continue to teach our children to read the old orthography, why would we ever consider putting ourselves through the trauma of replacing bad, old Roman script-based English orthography with CC? There is a recent case of a language community changing their orthography from one script to an unrelated script: Turkish was written in Arabic until the early part of this century, and since then in Roman. This was possible because: - the language community was pretty well limited to one nation, - the literacy rate was not that high, - the old script was not that well suited for representing the phonology of the language, - the new script was much better suited to represent the phonology of the language, - there was not a really large corpus of books existing that used the old script, and - there was an authoritarian government that was able to impose the reform on the entire language community. *None* of these are true of English. I have now contributed comments that relate to semiotic issues, to issues of the psychology and physiology of reading and writing, and to sociolinguistic issues of attitude and usage. I also threw in various comments on historical linguistic issues along the way. So, there shouldn't be any doubt of my opinion of introducing CC as a replacement for existing English orthography. While some of us may want to pursue the idea of writing English using CC as one of personal interest, we should not for a moment fool ourselves into thinking that CC could possibly become the conventional way of writing English, or even a conventional way of writing English. End of diatribe. Peter 27-Aug-99 19:17:45-GMT,2634;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id PAA25478 for ; Fri, 27 Aug 1999 15:17:44 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id MAA323172 ; Fri, 27 Aug 1999 12:06:04 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA06917; Fri, 27 Aug 99 11:49:34 -0700 Message-Id: <9908271849.AA06917@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 (generated by tm-edit 7.104) Content-Type: text/plain; charset=US-ASCII X-Uml-Sequence: 9465 (1999-08-27 18:49:22 GMT) From: Juliusz Chroboczek To: Unicode List Date: Fri, 27 Aug 1999 11:49:19 -0700 (PDT) Subject: Re: Normalization Form KC for Linux Rick McGowan : >> More formally, the preferred way of encoding text in Unicode under >> Linux should be Normalization Form KC as defined in Unicode >> Technical Report #15 RM> Gosh, I don't approve. And I've been using Unix systems for many RM> years. The most flexible kind of implementation would prefer RM> decomposed sequences. In any case, enlightened systems would RM> accept anything and massage as needed to fit the particular RM> application instead of forcing (or "suggesting") the user to run RM> everything through the meat grinder first... As I understand it, Markus was speaking about the interchange formats, including, but not limited to, file formats and IPC formats. It is expected that simple applications will only be able to accept precomposed forms, while enlightened ones (I like the term) will accept anything. Therefore, requesting that applications *write* precomposed forms in preference to combining characters maximises the chances of interchange between simple and complex applications. Complex applications are still expected to accept arbitrary combining characters; they just should avoid producing them whenever possible. (The question of unification of compatibility forms -- C vs. KC -- is a different issue altogether; not one I would dare to claim that I am even vaguely not totally incompetent to have an opinion on.) RM> In any case, I think Unix community tends in general to be very RM> very confused about the distinction between how data exists in RM> storage and what appears on one's screen/window/emulator. While to a certain extent true of the Unix-like community in general, this is not a fair assessment of Markus' work. J. 27-Aug-99 21:23:38-GMT,4459;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id RAA25860 for ; Fri, 27 Aug 1999 17:23:37 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id OAA253990 ; Fri, 27 Aug 1999 14:11:44 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA07561; Fri, 27 Aug 99 14:02:19 -0700 Message-Id: <9908272102.AA07561@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 9468 (1999-08-27 21:02:10 GMT) From: Rick McGowan To: Unicode List Date: Fri, 27 Aug 1999 14:02:09 -0700 (PDT) Subject: Re: Normalization Form KC for Linux Juliusz Chroboczek ...: > It is expected that simple applications will only be able to accept > precomposed forms I'd have to ask Why? > Complex applications are still expected to accept arbitrary combining > characters; they just should avoid producing them whenever possible. Why? That's about the opposite of what I'd argue. In my experience most of the drudgery and complexity of display processing for Unicode is in dealing with the multiple spellings; not with just decomposed or composed sequences. I guess maybe I should just shut up because my argument is really about something different than normalization itself, it's about architectures that require applications to care about particular details of data normalization. What really appears to be going on in the world of Unix is that generally in these systems the "legacy" or existing methods of string & character handling are being bolstered to deal with this new kind of data for which they are an inappropriate level of API. Instead of architecting them to remove the need for application programmers to worry about all this detail, the detail is being exported to the programmer in the same way that it was when the encodings were "simpler". I think it's the wrong way to go about the architecture. As I see it, systems that require "all" applications to mess around with the low-level details of what is or is not stored as a combining sequence in some string that's passing through some process is mis-architected from the start. Only the lowest level of data-streaming and I/O of file formats should be dealing with that. GUI & UI systems that sit on top of Unix foundations appear in general to be architected in ways that expose details, like composition/decomposition normalization of the data, excruciating details of codesets and data formats, that should be of no concern to "applications" written on theose platforms Unfortunately, the architects tend to get hung up on how to expose these details by extensive APIs, and argue a lot about details that should be of as little concern to "application programs" as assembly language is to Java programs. If one is going to re-write the set of typical Unix foundation-level tools, I think there are better ways to write them and different kinds of API that are more appropriate for better abstraction away from the minutiae of character encodings and normalization. That would free the application programmer from such details, instead of causing the application programmer to be acutely concerned with such details. So when I see something like this: > One day, combining characters will surely be supported under Linux, >... >> More formally, the preferred way of encoding text in Unicode under >> Linux should be Normalization Form KC as defined in Unicode >> Technical Report #15 It makes me cringe. This is saying that for everything written on this entire OS -- all the UI, the tools, protocols, applications, etc. that should be the "preferred" way of encoding simply because the display model is broken and the architects have been going in the wrong direction for years and wish to continue down that path because of the overwhelming weight of their legacy code. I think it's more appropriate to leave the specification of normalization requirements up to particular protocols or functional groups, not "Unicode under Linux" as a whole. In the long run, Linux would be much better off going the opposite direction for most string & display handling. In my opinion. Enough ranting for the day... Rick 30-Aug-99 19:42:09-GMT,5878;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id PAA21417 for ; Mon, 30 Aug 1999 15:42:07 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id MAA251190 ; Mon, 30 Aug 1999 12:29:43 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA26449; Mon, 30 Aug 99 12:17:24 -0700 Message-Id: <9908301917.AA26449@unicode.org> Errors-To: uni-bounce@unicode.org X-Uml-Sequence: 9519 (1999-08-30 19:17:14 GMT) From: Frank da Cruz To: Unicode List Cc: Unicode List Date: Mon, 30 Aug 1999 12:17:12 -0700 (PDT) Subject: Re: Normalization Form KC for Linux > Maybe I really should shut up... I guess I'm bitterly disappointed in how > the Unix and Posix community has not grasped the Unicode textual concepts > and progressed or led the way in all of this. The community seems so > insular and fossilized, when there are so many good things about Unix that > have been poorly imitated by other popular platforms. These days the > industry is moving right along doing all kinds of interesting display and > many scripts & languages, while the academic Unix (and Posix) folks are > complaining that it's too hard or can't be done at all.[1] > I think this is reflective of the overall situation with computing today. You can only change what you can control. In a monolithic environment like Windows or the Macintosh, a single company has control and can do what it likes, but perhaps more to point, these are closed boxes in which the application has more or less direct access to the keyboard, screen, fonts, and font info -- all the pieces of the puzzle. Contrast this with Unix. First of all (obviously) there is not just one Unix, but many of them (the UNIX C-Kermit makefile alone currently contains about 500 targets). Nobody controls all this. Each vendor goes their own way at their own pace. The many well-known utilities (command-line or "video") have long since "forked". The existing code base is staggering, and most of it is nondisclosed (Linux, *BSD, etc, are the exception (to "nondisclosed", if not to "forked")). Makers of third-party applications for Unix (and VMS, etc), if they want to move forward, can't (in most cases) depend on the underlying platforms for assistance. Even when they can, such assistance is inconsistent, forcing them to develop their own portable tools and libraries, which tend to meet their immediate needs but fall short of Nirvana. Perhaps more to the point, however, is the fact that Unix (and VMS and other "traditional" platforms) are open to many kinds of access: the workstation console, usually some sort of GUI (also on the console), X (on the console or from a remote X server), and then plain old character-mode remote access via modem, Telnet, Rlogin, X.25 PAD, and the like. The latter mode, which is branded "legacy" as if it had no value or place in the modern world, is (I like to maintain, and I think with good reason) seeing wider use now than ever before and although many wish it would go away, others would like to stay active in this area and serve the people who depend on it, not only for old time's sake, but also because it is a legitimate, viable, and open form of access that everybody should be able to fall back upon as the the more advanced and "interesting" forms change out from under them with bewildering speed. When access is this open -- which is a *good* thing -- no particular entity has control over the user interface. It is a matter of coordinating the behavior of intrinsically unrelated processes. So questions come up here that never bother us when we are writing (say) a word processor. Which end handles bidirectionality of Hebrew? Which end is responsible for the detailed appearance of the screen? And now the questions of pre- and de- composition. Makers of third party applications only control one piece. The underlying platform is likely not to have any Unicode support at all (VMS, most UNIXes, IBM mainframes, etc), so the extent to which we support Unicode in our applications depends on the hosts that we access with them. In the case of terminal emulation (xterm, Kermit, etc), if the host is not executing any form of BIDI algorithm, or ensuring some canonical form for composed characters, etc (since it is totally ignorant of such matters), it does not necessarily follow that the terminal must compensate, since for applications where the screen is treated as a matrix of boxes in which the location of different items must be known and fixed (and this can include dumb scrolling applications that display text in columns), the host and terminal must cooperate. ISO 10646 includes the concepts of levels of compliance, including Implementation Level 1 in which combining characters are not allowed. Unicode Normalization Form C tends to amount to the same thing. If these "subsets" were not to be used, they shouldn't have been defined. But in fact, I believe they are useful in open-access environments where control is distributed among "loosely cooperating" processes. Perhaps there is indeed a tradeoff between open access and the ability to support complex scripts -- if not in theory, then almost certainly in practice. Of course, we do have one example of Unix taken to the next level: Plan 9. But even there -- where all text, even internally, is UTF-8 -- we still see no provision for BIDI or combining sequences: Implementation Level 1 in action. Everyone agrees it would be better to have no restrictions, but so far I don't think anybody has considered the plain-text terminal-host access model sufficiently to find a way around them. - Frank 10-Sep-99 16:30:02-GMT,3828;000000000001 Return-Path: Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id MAA11277 for ; Fri, 10 Sep 1999 12:29:59 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with SMTP id JAA31860 ; Fri, 10 Sep 1999 09:19:15 -0700 Received: by unicode.org (NX5.67g/NX3.0S) id AA15929; Fri, 10 Sep 99 08:48:25 -0700 Message-Id: <9909101548.AA15929@unicode.org> Errors-To: uni-bounce@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-Uml-Sequence: 9618 (1999-09-10 15:48:15 GMT) From: peter_constable@sil.org To: Unicode List Date: Fri, 10 Sep 1999 08:48:14 -0700 (PDT) Subject: Re: IPA a vowels Content-Transfer-Encoding: 7bit >The other side of this issue is coding ambiguity. Say you have some African language which uses an IPA-influenced orthography, will you use LATTIN SMALL LETTER A or your new homoglyph LATIN SMALL LETTER A WITH HOOK here? >I believe, the conclusion is that we should not think in terms of being able to add IPA highly consistently to every font there is. Only a few font styles are really useful for being extended into good IPA fonts, so if you write dictionaries, linguistic textbooks, etc., you should make sure you use one of these font styles. Do not expect that every Unicode font will contain every Unicode character in high quality. Unicode should be more seen as a scheme to encode characters, not as a repertoire that from now on every font has to cover entirely. I agree that we probably don't want every font to be used for IPA. But there still is an issue of encoding ambiguity when dealing with plain text. Perhaps the answer, though, is that, strictly speaking, plain text is effectively meaningless. Knowing the encoding tells you how to get one level of semantics, i.e. how to translate the bytes into abstract characters, but you still don't know what the sequence of characters mean in terms of any human language until the language is identified. If you get a plaintext file and it contains "See Dick run." Then you'll make an assumption about the intended language, and that assumption will probably be valid. But it's an assumtion nontheless. When there is real potential ambiguity, there is no recourse but to provide some markup: See Dick run. (undoubtedly means something derogatory about the listener's grandmother). If the plaintext happens to mix text in IPA and text a language that uses U+0061, then if there is confusion it may be necessary to have markup along the lines of The Blahurg word for ... pronounced, " ...a... ", and means ... upset. Of course, I probably wouldn't complain if there was a separate character LATIN IPA SMALL LETTER A that disambiguated this for plain text. (Nobody should be confused about the purpose of a character with such a name.) Ditto for other cases. >For every font style, there are Unicode characters that will not go well with it. High-quality fonts will therefore always be Unicode subsets only, and applications such as Web browsers who can prevent certain characters from being used in certain style contexts will brutally fall-back to other styles (e.g., pick math operators from the upright font even inside italic text). So let it be written; so let it be done. Peter 18-Sep-99 9:42:10-GMT,4286;000000000001 Return-Path: Received: from humbolt.nl.linux.org (humbolt.geo.uu.nl [131.211.28.48]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id FAA27963 for ; Sat, 18 Sep 1999 05:42:09 -0400 (EDT) Received: by humbolt.nl.linux.org id ; Sat, 18 Sep 1999 11:41:51 +0200 Received: from deimos.worldonline.nl ([195.241.48.136]:57730 "EHLO deimos.worldonline.nl" smtp-auth: ) by humbolt.nl.linux.org with ESMTP id ; Sat, 18 Sep 1999 11:41:14 +0200 Received: from moolenaar.net (vp208-34.worldonline.nl [195.241.208.34]) by deimos.worldonline.nl (8.8.5/8.8.5) with ESMTP id LAA11606; Sat, 18 Sep 1999 11:41:11 +0200 (MET DST) Received: from masaka.moolenaar.net (localhost.moolenaar.net [127.0.0.1]) by moolenaar.net (8.9.1/8.9.1) with ESMTP id LAA00348; Sat, 18 Sep 1999 11:58:07 +0200 (CEST) Message-Id: <199909180958.LAA00348@moolenaar.net> To: Markus Kuhn Cc: linux-utf8@humbolt.geo.uu.nl Subject: Re: UTF-8 line feeds versus LS/PS In-Reply-To: From: Bram Moolenaar Date: Sat, 18 Sep 1999 11:58:07 +0200 X-Orcpt: rfc822;linux-utf8@humbolt.geo.uu.nl Sender: owner-linux-utf8@humbolt.geo.uu.nl Precedence: bulk Reply-To: linux-utf8@humbolt.geo.uu.nl Markus Kuhn wrote: > Side remark: > > It would indeed be nice to also introduce under Unix a text format, > where paragraphs are formatted at display time (like Word does), and > where soft linebreaks inside paragraphs are not saved to the file. The > main advantage here is that diffs become significantly compacter > (assuming they would operate on byte ranges, not on lines), because > changing a few words followed by reformatting a paragraph moves around > all these LF bytes that then the revision control system has to take > track of, which is not very elegant at the moment. > > It would indeed be very helpful, if emacs, vim, less, etc. had a mode > similar to the Windows notepad and Word, where paragraphs are > essentially long lines without any LF in them. LF-free paragraphs would > especially be convenient for editing plaintext-files that will later be > reformatted anyway and where line length doesn't matter at all, e.g. > HTML and TeX. This is true. The reason Vim doesn't support automatic paragraph formatting is that there is no "soft" line separator. I'm glad there is something we can agree on! You can work with single-line paragraphs in Vim by setting the 'linebreak' option. This might be the mode you are looking for. See ":help 'linebreak'" for more information. One disadvantage is that the width of the wrapped lines depends on the width of the terminal. If you view the file on a different terminal it may look different. It might be different again when you print it. That might not always be what you want. Wordstar (do you remember that?) had a soft linebreak character for this (CR with the 8th bit set). But only Wordstar supported it, thus it wasn't very useful. You always had to print the file from Wordstar. > However, all this is again *completely* independent and orthogonal to > Unicode. Unformatted plain-text files would also be nice with just > ASCII, and LF is as good a paragraph separator as Unicode's PS. I'd > rather not use LS and PS at all on POSIX systems, because it would break > a tremendous amount of software, even though I do appreciate that the > clearly-defined LS/PS semantics does have its attractions and is much > nicer in UCS-2 files than the historic CR/LF/NL mess. Just using NL should work fine. As far as I know LF is just another name for NL, it's the same character (hex 0x0A). A paragraph could be ended by an empty line (in the file that's a double NL). We could even recommend this. Perhaps we should add a note about this in appropriate places? -- hundred-and-one symptoms of being an internet addict: 102. When filling out your driver's license application, you give your IP address. --/-/---- Bram Moolenaar ---- Bram@moolenaar.net ---- Bram@vim.org ---\-\-- \ \ www.vim.org/iccf www.moolenaar.net www.vim.org / / - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/ 18-Sep-99 10:58:57-GMT,2849;000000000001 Return-Path: Received: from humbolt.nl.linux.org (humbolt.geo.uu.nl [131.211.28.48]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id GAA08685 for ; Sat, 18 Sep 1999 06:58:56 -0400 (EDT) Received: by humbolt.nl.linux.org id ; Sat, 18 Sep 1999 12:58:37 +0200 Received: from khms.westfalen.de ([193.174.5.20]:783 "EHLO khms.westfalen.de" smtp-auth: ) by humbolt.nl.linux.org with ESMTP id ; Sat, 18 Sep 1999 12:58:13 +0200 Received: from root by khms.westfalen.de with local-bsmtp (Exim 3.03 #1) id 11SIBl-000423-00 (Debian); Sat, 18 Sep 1999 12:57:53 +0200 Received: by khms.westfalen.de (CrossPoint v3.11 R/C435); 18 Sep 1999 12:57:04 +0200 Date: 18 Sep 1999 10:56:00 +0200 From: kaih@khms.westfalen.de (Kai Henningsen) To: linux-utf8@humbolt.geo.uu.nl Message-ID: <7P6x5KJmw-B@khms.westfalen.de> In-Reply-To: Subject: Re: UTF-8 keyboard mode X-Mailer: CrossPoint v3.11 R/C435 MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit Organization: Organisation? Me?! Are you kidding? References: X-No-Junk-Mail: I do not want to get *any* junk mail. Comment: Unsolicited commercial mail will incur an US$100 handling fee per received mail. X-Fix-Your-Modem: +++ATS2=255&WO1 X-Orcpt: rfc822;linux-utf8@humbolt.geo.uu.nl Sender: owner-linux-utf8@humbolt.geo.uu.nl Precedence: bulk Reply-To: linux-utf8@humbolt.geo.uu.nl Andries.Brouwer@cwi.nl wrote on 18.09.99 in : > >From kaih: Mac example. > > Yes. For us this would be a bit more complicated, because people > really use the power of the keyboard handler. > Any key can be a modifier key, and people use for example F12 > to switch between a dvorak and a qwerty layout by loading > a large keymap where F12 is a locking shift. > Similar things are done by Greeks, Russians etc to switch between > character sets. I don't see why that would create any problem. The kernel knows what a modifier key is, right? IIRC, it already has a bitmap-type interface via IOCTLs. > I thought of having /dev/kbd with packets for the past 256 keystrokes > or so, where these packets are thrown away if no-one reads them. > You really want these bytes in the normal input stream? That's the only way it'll work over a telnet connection. > Sounds like a new keyboard state, and again difficult to get out of > if this program that understands the stream crashes. You could define a key sequence that restores the keyboard to normal mode. Something like Alt-SysReq-R ... uh, we already have that one. MfG Kai - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/ 18-Sep-99 16:45:52-GMT,6832;000000000001 Return-Path: Received: from humbolt.nl.linux.org (humbolt.geo.uu.nl [131.211.28.48]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id MAA08272 for ; Sat, 18 Sep 1999 12:45:51 -0400 (EDT) Received: by humbolt.nl.linux.org id ; Sat, 18 Sep 1999 18:45:34 +0200 Received: from heaton.cl.cam.ac.uk ([128.232.32.11]:42246 "EHLO heaton.cl.cam.ac.uk" smtp-auth: ) by humbolt.nl.linux.org with ESMTP id ; Sat, 18 Sep 1999 18:45:08 +0200 Received: from trillium.cl.cam.ac.uk ([128.232.8.5] helo=cl.cam.ac.uk ident=mgk25) by heaton.cl.cam.ac.uk with esmtp (Exim 3.01 #1) id 11SNbm-0004EV-00 for linux-utf8@humbolt.geo.uu.nl; Sat, 18 Sep 1999 17:45:06 +0100 X-Mailer: exmh version 2.0.2+CL 2/24/98 To: linux-utf8@humbolt.geo.uu.nl Subject: Re: Character set tagging considered harmful In-reply-to: Your message of "Sat, 18 Sep 1999 14:23:48 +0200." <199909181223.OAA00813@moolenaar.net> X-URL: http://www.cl.cam.ac.uk/~mgk25/ Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Sat, 18 Sep 1999 17:45:04 +0100 From: Markus Kuhn Message-Id: X-Orcpt: rfc822;linux-utf8@humbolt.geo.uu.nl Sender: owner-linux-utf8@humbolt.geo.uu.nl Precedence: bulk Reply-To: linux-utf8@humbolt.geo.uu.nl Bram Moolenaar wrote on 1999-09-18 12:23 UTC: > I wonder, is UCS-4 the maximum that is in use today? More than that. The UCS-2 range is the maximum in use today. There are no characters yet defined outside the range U+0000 to U+FFFD, which is known as "Plane 0" (except the so-called Plane-14 tags, which are not really part of Unicode). A plane is a 16-bit range with 2**16 code points. However, there do exist plans to fill Plane 1 with scripts that are of historic, cultural, hobbyist and scientific interest (Hierglyphics, Tengwar, Klingon, Blissymbolics, very exotic mathematical symbols, etc.). These are characters that are not urgently needed (there exists very little practice in encoding them on computers today if any at all), but it is nice to have them covered at least in theory as well. There are also plans to fill Plane 2 with thousands of historic CJK characters, to cover all characters found in some very comprehensive Asian dictionaries (again, also character not used on computers today). So it is good to be prepared for more than UCS-2. UTF-16 is an extension of UCS-2 that uses a pair of 16-bit characters from a high and low surrogate area in UCS-2 to represent characters in planes 1 to 16 (U+010000 to U+10FFFF). UTF-16 can cover a bit over 1 million characters. It has been agreed between the Unicode consortium and ISO that they will never standardize a character with a code > U+10FFFF. So UTF-16 will be able to encode everything that will come in the future. A code range of 1 million is commonly considered to be more then good enough. Plenty of room for contact with extraterrestrials ... ;-) > I need to reserve space for each character, thus I > would like to know if 4 bytes is enough. 4-bytes per character is *more* then enough per character. UCS is just a 31-bit character set after all, so a signed 32-bit int (that is what glibc's wchar_t is) will more then do. Even 3 bytes will last forever and 2-bytes would be OK so far if you are prepared to handle pairs of UTF-16 surrogate values as single characters. > The UTF-8 encoding might be longer, of course. No. Better have another careful look at how UTF-8 really works: http://www.cl.cam.ac.uk/~mgk25/unicode.html UTF-8 has no way of encoding characters more than 31-bit long. A 32-bit integer will be able to hold the value of any legal UTF-8 sequence. XFree86 xterm is restricted to the UCS-2 range by the way, as is the X11 font mechanism. My advice would be to try and keep UTF-8 as the in-memory encoding. Do not convert to a fixed-width encoding unless really necessary for table-lookups, etc. The self-synchronizing properties of UTF-8 make this very feasible. You can even preserve illegal UTF-8 sequences this way such that you loose no information if you load and save a binary file accidentally in UTF-8 mode. Mined98 is doing this nicely, as are a number of other existing UTF-8 editors. The plan for emacs is also to keep UTF-8 as the in-memory representation, in the interest of binary transparency. > Are you saying that it's not possible to detect UTF-8 encoding reliably? > Well, that's something that needs to be worked on! LC_CTYPE is the best detector you will ever get. It allows us so far to distinguish ISO_8859-15 from JISX0208, and I see no reason why it should suddenly fail on UTF-8. Everything else is just a heuristic. The self-synchronizing properties of UTF-8 make it more feasible to write a > 95% heuristic for UTF-8 then for other encodings, but you should be careful to apply such autodetection ONLY when the user didn't tell you explicitely via LC_CTYPE what the intended encoding is. The user must be able to reliably enforce interpretation of the file as UTF-8 for mission-critical applications, where the remaining risk of autodetection or tagging is not acceptable. I assure you, that UTF-8 files will not be tagged in any special way on POSIX systems. Just like ASCII and ISO 646-Swedish files were never tagged in any special way. Typed files are simply not the Unix way, for very good reasons. There will be no BOM or ESC 2022 announcer, and if there is one occasionally, it will either cause trouble or be lost after the next cut & paste, grep, tail, conversion, etc. This stuff is not robust in general. It might work in special restricted applications, but not more. The world is already full of UTF-8 files. Search for UTF-8 on dejanews, and you'll hit a hundred thousand postings, because Asian versions of Netscape and IE have been sending out UTF-8 files for years. > > We just want a toggle, between Mess and UTF-8. > > And we need to help the people that have to toggle all the time. Exactly, by offering them an option to leave the error-prone toggling and character-set guessing domain. > Switching to a single encoding is not an option for most people at this time, > since many files are Latin-1 encoded. The files are really not the problem. Files are very easily converted without loss of information. The problem are applications that can structurally not yet deal with files that can contain a million different characters. Most applications believe that there exist not more than 256 characters. That is the real problem. Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/ 21-Sep-99 14:01:11-GMT,3549;000000000005 Return-Path: Received: from humbolt.nl.linux.org (humbolt.geo.uu.nl [131.211.28.48]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id KAA19778 for ; Tue, 21 Sep 1999 10:01:09 -0400 (EDT) Received: by humbolt.nl.linux.org id ; Tue, 21 Sep 1999 16:00:38 +0200 Received: from heaton.cl.cam.ac.uk ([128.232.32.11]:6151 "EHLO heaton.cl.cam.ac.uk" smtp-auth: ) by humbolt.nl.linux.org with ESMTP id ; Tue, 21 Sep 1999 16:00:12 +0200 Received: from trillium.cl.cam.ac.uk ([128.232.8.5] helo=cl.cam.ac.uk ident=mgk25) by heaton.cl.cam.ac.uk with esmtp (Exim 3.01 #1) id 11TQSl-0006JZ-00 for linux-utf8@humbolt.geo.uu.nl; Tue, 21 Sep 1999 15:00:07 +0100 X-Mailer: exmh version 2.0.2+CL 2/24/98 To: linux-utf8@humbolt.geo.uu.nl Subject: Re: Character set tagging considered harmful In-reply-to: Your message of "Tue, 21 Sep 1999 15:08:51 +0200." <199909211308.PAA26748@mail.sietec.de> X-URL: http://www.cl.cam.ac.uk/~mgk25/ Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Tue, 21 Sep 1999 15:00:04 +0100 From: Markus Kuhn Message-Id: X-Orcpt: rfc822;linux-utf8@humbolt.geo.uu.nl Sender: owner-linux-utf8@humbolt.geo.uu.nl Precedence: bulk Reply-To: linux-utf8@humbolt.geo.uu.nl towo@computer.org wrote on 1999-09-21 13:08 UTC: > I think there is some confusion here. Auto-detection applies to text, > i.e. file contents, while I would assume LC_CTYPE to describe the > environment that we're running in, especially the terminal mode. > This doesn't need to be the same and if LC_CTYPE is used to define one > thing it should perhaps rather not be used to derive the other information > which is usually quite unrelated. I really think, they are the same, they were intended to be the same and in my opinion they really should be the same. I like cat file.txt to continue to work in our notion of plaintext also in the future, therefore we should always aim towards keeping the content of plain-text something that can be sent directly byte-for-byte to the terminal. Much of the current simplicity, elegance and power of the Unix plaintext world fundamentally depends on this. It won't be Unix any more if we start to introduce plaintext file types. (By the way, we had this exact same discussion already back in 1995 on comp.std.internat, should still be in dejanews.) How far do you want to implement autodetection? Do you want "ls" to autodetect, whether a filename is in Latin-2, Latin-15, JIS X0208 or UTF-8 and convert automatically accordingly? Character set autodetection, if it really became common-place under Unix, would mean that practically every application would have to be equipped with a full-fledged any-to-any conversion package. Horrible prospect. No, I really really think that separating the plain-text and terminal encoding is a rather dangerous route, that I most certainly will not support in any way. All this also has nothing to do with UTF-8, which is just yet another encoding and should be treated just as such. The entire autodetection or tagging business sounds to me very much like reinventing ISO 2022 with all its consequences. Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/ 28-Sep-99 14:15:08-GMT,3275;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub1.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id KAA15969 for ; Tue, 28 Sep 1999 10:15:07 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id KAA26042 for ; Tue, 28 Sep 1999 10:15:06 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id HAA25768 ; Tue, 28 Sep 1999 07:12:05 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id HAA03136; Tue, 28 Sep 1999 07:10:29 -0700 (PDT) Message-Id: <199909281410.HAA03136@unicode.org> Errors-To: root@unicode.org X-UML-Sequence: 9872 (1999-09-28 14:10:15 GMT) From: Frank da Cruz To: "Unicode List" Cc: unicode@unicode.org Date: Tue, 28 Sep 1999 07:10:14 -0700 (PDT) Subject: Re: A basic question on encoding Latin characters > Um, at that time the normalization hadn't been done. So at that time there > weren't _technical_ reasons for drawing a line at the normalization > border. The line was drawn after that time. It could have been > before. But it has been drawn and there had better be really good reasons > offered if we are not to respect it. > In interactive telecommunications, we have the following situation: 1. Host sends "login:" (or any other prompt). 2. User is supposed to type her ID (or any other response). When using Unicode, the terminal emulator may not print the final character of the prompt because it doesn't know yet whether any combining characters will follow. So the user doesn't know whether the host is ready to receive a response and therefore should not reply since in some cases (e.g. at the UNIX "Password:" prompt) an early response is discarded. If the process is being executed by a script, the script sits and waits; "waitfor 'login:'" will not succeed, since it can not be known whether 'login:' has arrived until the next base character after ':' comes, but no such character is coming (I realize it is silly to expect a colon to have an accent but those are the rules -- and not all prompts end with colon). There is no escape from this situation other than introduction of a "higher level protocol" to signal "ok, I'm finished transmitting, now it's your turn", just like in the old half-duplex days. This is the kind of reason that telecommunications-oriented applications seem to be steering away from the Normalization Form D model, however appropriate it might be in other areas, and embracing Normalization Form C (ISO 10646 Level 1) and, by extension, precomposed characters, as we have seen in Plan 9 and now, it seems, Linux. I don't think this indicates recalcitrance or West European bias in UNIX culture as much as a desire to preserve telecommunications and the terminal/host model as a viable interface between human and machine in the Unicode age, as it has been since beginning of the computer age. I also think it's no accident that Unicode is best supported on those platforms that have eschewed the terminal/host access model. - Frank 28-Sep-99 18:24:11-GMT,7794;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub2.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id OAA03161 for ; Tue, 28 Sep 1999 14:24:09 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA19282 for ; Tue, 28 Sep 1999 14:24:08 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id LAA48012 ; Tue, 28 Sep 1999 11:22:08 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id LAA06555; Tue, 28 Sep 1999 11:18:45 -0700 (PDT) Message-Id: <199909281818.LAA06555@unicode.org> Errors-To: root@unicode.org X-UML-Sequence: 9885 (1999-09-28 18:18:30 GMT) From: Frank da Cruz To: "Unicode List" Date: Tue, 28 Sep 1999 11:18:29 -0700 (PDT) Subject: RE: A basic question on encoding Latin characters Marco.Cimarosti@icl.com wrote: > I am not sure if I understood very well, but seems to me that you are > basing your observation on the very peculiar behavior of your application. > Not peculiar -- this is how open and shared access to computers has worked since the 1960s: the interactive dialog model, prompt and command. > I understand that your hypotetical terminal software is trying to render > Unicode text as soon as it arrivers, CHARACTER BY CHARACTER. > That's how terminals work. If the host sends a character, the user should see it on the screen immediately. As any maker of terminal emulation software can tell you, users are surprisingly intolerant of delays, even very small ones. The acid test is echoing in the full-duplex environment. I press the 'A' key, the code for 'A' goes to the host and then comes back to be displayed on the screen as an 'A'. This must be instantaneous. Or, to put it another way, a terminal is not a Web browser. > But there is no need of exotic alphabets or combining accents to screw up > your design: sticking to good old ASCII, what would your modem script do > if the prompt "login:" was translated in the Italian "codice d'accesso:"? > It would wait, I think, until the Italian government changes the > constitution to drop Italian and adopt English as the official language. > True, but the fact remains that a very large number of scripting applications exist and are used every day in the real world, and they are used in "mission-critical" applications too. It is "a way of doing business" in a world where platforms such as UNIX, VMS, VOS, VM/CMS, MVS/TSO, and OS/400 still exist and may be accessed openly. Modems themselves are controlled almost exclusively by scripts (how do you think your PPP dialer works?). The business of Unicode is not to promote certain styles of computing and obliterate others; it is to provide a universal character set that can be used in any application. > If such a medieval design cannot be avoided because of technical > constraints, it would be wiser, in my mind to do one of the following: > > - support Unicode only after login; > Login is just one example. A terminal session with a UNIX (VMS, VOS, etc) host is an arbitrary series of prompts and commands. > - impose that the prompt and the answer be on separate lines: in this > case, the line terminator character(s) would act as the "higher level > protocol" to signal "ok, I'm finished transmitting, now it's your turn" > that you suggested; > A proposal to change all of the world's hosts is not practical. Even if this were done, it would break all the world's scripts :-) > - re-ingeneer entirely the login and terminal software using more > up-to-date techniques. > Of course many people believe the answer is to modernize everything. But today this means replacement of simple, proven, and open means of access with proprietary and unstable ones. Franηois Yergeau wrote: > There is no good reason for the terminal not to print the final character > when received. If a combining character comes later, the terminal simply > has to redisplay the combination over the previous glyph. This is what our > Arabic terminals and emulators have been doing for years (e.g. receive an > Arabic letter and display it in final form; receive another letter, > redisplay the previous one in middle form and the new one in final form). > Yes, we discussed this here before; there are complications with line wrapping, scrolling regions, etc, but to overcome them is a "mere matter of programming". > >There is no escape from this situation other than introduction of a "higher > >level protocol" to signal "ok, I'm finished transmitting, now it's your > >turn", just like in the old half-duplex days. > > Well, it seems to me that the login protocol *is* a higher level protocol > w/r Unicode. > Again, the login process is only one element of a session consisting of an arbitrary sequence of prompts and responses. > If the protocol says that "login:" is to be acted upon, I don't see why > the terminal-side script couldn't act on it without waiting for eventual > combining characters that won't be coming. There's no use in waiting for > the next base character, the triggering string has been received. > But then is the application "Unicode compliant"? But more to the point (bearing in mind that we are speaking not just of logging in, but any prompt and response), if we ignore the possibility that combining characters might follow the trigger string, then we can have "false positives", or for that matter also false negatives. "Mark E. Davis" wrote: > We should make it very clear that Normalization Form C does *not* > eliminate combining characters. It does precompose them where possible, > but for many scripts and characters it is not possible, or desireable. > Yes, this is spelled out very clearly in the technical report. In this way Unicode Normalization Form C differs from ISO 10646 Implementation Level 1, in which "a CC element shall not contain coded representations of combining characters". I think this more accurately represents the position taken by the authors of Plan 9 and (correct me if I'm wrong) those working on the Linux console and UTF-8 xterm. > Exactly the same problem that you discuss occurs with any script that > requires shaping. When I type an Arabic character, the previous character > needs to change shape. What the terminal needs to do is replace the glyph > on the screen with a different form. As I recall from my terminal days, > the controls for doing this are available. The same technique can be used > for accents. Type an A, see an A. Then type an umlaut, and the host picks > it up, decides that it needs a composed presentation form, and replaces > the A by Δ on the screen. Of course, the display on the terminal still > depends on the ''font" that it has, which may or may not allow dynamic > composition, but fundamentally I don't see the problem. > The real problem comes in scripting. Scripts are a method of forcing intrinsically noncooperating processes to cooperate. Suppose a script is looking for "ABC", and ABC comes. If the next character will be a combining cedilla, this would not be a match. But if no more characters are coming (e.g. until there is some kind of response) then it would be, but how can the script know? The best we can do is set a timeout period that is long enough to allow for the longest possible intercharacter spacing on the busiest day of the Internet and hope we haven't guessed wrong. And even if we haven't, this technique would cause every match to consume the entire timeout interval. - Frank 28-Sep-99 19:12:24-GMT,6409;000000000011 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub2.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id PAA13696 for ; Tue, 28 Sep 1999 15:12:22 -0400 (EDT) Received: from halon.sybase.com (halon.sybase.com [192.138.151.33]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id PAA06997 for ; Tue, 28 Sep 1999 15:12:19 -0400 (EDT) Received: from smtp1.sybase.com (sybgate.sybase.com [130.214.220.35]) by halon.sybase.com with ESMTP id MAA25099; Tue, 28 Sep 1999 12:11:20 -0700 (PDT) Received: from birdie.sybase.com (birdie.sybase.com [130.214.140.3]) by smtp1.sybase.com with SMTP id MAA19920; Tue, 28 Sep 1999 12:12:14 -0700 (PDT) Received: by birdie.sybase.com (5.x/SMI-SVR4/SybEC3.5) id AA12713; Tue, 28 Sep 1999 12:12:13 -0700 Date: Tue, 28 Sep 1999 12:12:13 -0700 From: kenw@sybase.com (Kenneth Whistler) Message-Id: <9909281912.AA12713@birdie.sybase.com> To: fdc@watsun.cc.columbia.edu Subject: RE: A basic question on encoding Latin characters Cc: unicode@unicode.org, kenw@sybase.com X-Sun-Charset: ISO-8859-1 Frank continued this discussion: > > > If the protocol says that "login:" is to be acted upon, I don't see why > > the terminal-side script couldn't act on it without waiting for eventual > > combining characters that won't be coming. There's no use in waiting for > > the next base character, the triggering string has been received. > > > But then is the application "Unicode compliant"? Of course it is. If the application is waiting for "login:", it is not waiting for "login:" with an acute accent on the colon. It is interpreting what it is supposed to, given the characters encoded at the code values they have. If the communicator then sends a combining acute accent, that is a *protocol* error, not a Unicode compliance problem. > But more to the point > (bearing in mind that we are speaking not just of logging in, but any prompt > and response), if we ignore the possibility that combining characters might > follow the trigger string, then we can have "false positives", or for that > matter also false negatives. Once again, this would be a *protocol* error. If the communication protocol is waiting for "xxxxα", then it should act when it receives the final "α" as a unit, or if it has received an "a", then it should act when it receives the final combining acute accent. And ordinarily the communication protocol should specify a normalized form, so it doesn't have to deal with alternative forms as equivalent for these purposes. And many of these call/response protocols wait for a control code as the trigger anyway, right? Very often the EOL. Otherwise they are rather badly behaved, for interactive work anyway, since a host would then always be sending bad typists irrelevant error messages without letting them backspace and correct their errors before committing to send a chunk for interpretation as a response/command/whatever. > > "Mark E. Davis" wrote: > > We should make it very clear that Normalization Form C does *not* > > eliminate combining characters. It does precompose them where possible, > > but for many scripts and characters it is not possible, or desireable. > > > Yes, this is spelled out very clearly in the technical report. In this way > Unicode Normalization Form C differs from ISO 10646 Implementation Level 1, > in which "a CC element shall not contain coded representations of combining > characters". I think this more accurately represents the position taken by > the authors of Plan 9 and (correct me if I'm wrong) those working on the > Linux console and UTF-8 xterm. And as the Unicoders have continually pointed out, Implementation Level 1 is a crutch for brain-damaged implementations that cannot handle anything complex. It rules out support for all of the complex scripts of the world. It does, however, do a reasonable job of covering Europe and East Asia, aside from some minority languages. Hmmm. Sound like a recipe for maintaining the computing access status quo to anyone? > > > Exactly the same problem that you discuss occurs with any script that > > requires shaping. When I type an Arabic character, the previous character > > needs to change shape. What the terminal needs to do is replace the glyph > > on the screen with a different form. As I recall from my terminal days, > > the controls for doing this are available. The same technique can be used > > for accents. Type an A, see an A. Then type an umlaut, and the host picks > > it up, decides that it needs a composed presentation form, and replaces > > the A by Δ on the screen. Of course, the display on the terminal still > > depends on the ''font" that it has, which may or may not allow dynamic > > composition, but fundamentally I don't see the problem. > > > The real problem comes in scripting. Scripts are a method of forcing > intrinsically noncooperating processes to cooperate. Suppose a script is > looking for "ABC", and ABC comes. If the next character will be a combining > cedilla, this would not be a match. But if no more characters are coming > (e.g. until there is some kind of response) then it would be, but how can > the script know? By the EOL or other end-of-content marking built into the protocol. How many of these script protocols can you point to that really are sitting posed hair-triggered forever waiting for the right (character) byte to come down the wire? Or if they are, isn't the triggering character usually a control delimiter of some sort? If you are worried about false positives for some string followed by a combining character, why not that same string followed by *ANY* character. You would have to guarantee that no long response has any prefix that could be misinterpreted (before the response was completely received) as a shorter response. > The best we can do is set a timeout period that is long > enough to allow for the longest possible intercharacter spacing on the > busiest day of the Internet and hope we haven't guessed wrong. Why isn't this exactly the same problem for any prefix of any response, even without combining characters? > And even if > we haven't, this technique would cause every match to consume the entire > timeout interval. Sounds like a purty flimsy strawman to me. --Ken > > - Frank > 28-Sep-99 20:24:53-GMT,4492;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub3.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id QAA03196 for ; Tue, 28 Sep 1999 16:24:52 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id QAA24140 for ; Tue, 28 Sep 1999 16:24:49 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id NAA68116 ; Tue, 28 Sep 1999 13:20:25 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id NAA08207; Tue, 28 Sep 1999 13:16:17 -0700 (PDT) Message-Id: <199909282016.NAA08207@unicode.org> Errors-To: root@unicode.org X-UML-Sequence: 9892 (1999-09-28 20:16:05 GMT) From: Frank da Cruz To: "Unicode List" Cc: unicode@unicode.org Date: Tue, 28 Sep 1999 13:16:04 -0700 (PDT) Subject: RE: A basic question on encoding Latin characters Ken wrote: > Sounds like a purty flimsy strawman to me. > It might well be. > > But if no more characters are coming > > (e.g. until there is some kind of response) then it would be [a match], > > but how can the script know? > > By the EOL or other end-of-content marking built into the protocol. > But there is no protocol. Most prompts do not end with an EOL. A script is by nature an attempt to codify human behavior in a stimulus-response situation. The stimuli are designed for people, not protocols, and in any case are usually not changeable (maybe you can change them, but as soon as you do be prepared for screams of agony to go up from the masses who, unbeknownst to you, depend for the livelihood on the prompts not changing). Thus the script must adapt to whatever is on the other end of the connection. If the prompt is "login:" with no EOL, we can't force an EOL to come; ditto for other dialog situations in which the prompt more likely to end with some character that might reasonably be followed by a combining character (or not). > ... ordinarily the communication > protocol should specify a normalized form, so it doesn't have to deal > with alternative forms as equivalent for these purposes. > I believe this is what telecommunications-oriented platforms and/or applications are doing when they avoid the issue of combining forms by saying they don't support them. > ... as the Unicoders have continually pointed out, Implementation Level 1 > is a crutch for brain-damaged implementations that cannot handle anything > complex. It rules out support for all of the complex scripts of the world. > Meaning Indic, Arabic, etc... Of course this is true, and yet Level 1 exists and developers will use it. We have in UTF-8 a vigorous attempt to embrace the "legacy" terminal/host world and existing applications to promote easy migration from ASCII to Unicode (and somewhat less easy from 8-bit character sets). But these very platforms are accessed in a simple and open manner which does not mesh well with complex scripts. We might wish to wipe away the legacy of fifty years of computing and start over (in more ways than one!) but I fear there will never be a replacement for the simple and open terminal/host access method that will support complex scripts and still be as open and vendor-neutral as the terminal/host model. We are suffering already from the lack of open (e.g. Telnet) access to Macintosh and Windows platforms. I'm not saying I know what to do, only that "throw away your medieval tools and enter the modern age" is as likely to result in a new Tower of Babel as it is to promote universal communication. But this time the Babel is not in character sets but in the profusion of ever-changing and incompatible vendor- and application-specific protocols and data formats. Perhaps it's all a tempest in a teapot. For some time to come we will have all possible combinations of "legacy" and Unicode-aware hosts and clients, and we have to allow for each combination. Different problems will come up in each configuation, and we'll see how to deal with them. My hope is that it will not be by inventing a neverending stream of Three-Letter Acronyms to "comply" with, on top of Unicode itself, just to get text from point A to point B. If you thought you hated ISO 2022, just think of the standards nightmare that will grow out of that! - Frank 30-Sep-99 15:24:27-GMT,2366;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub1.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id LAA18471 for ; Thu, 30 Sep 1999 11:24:26 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id LAA03290 for ; Thu, 30 Sep 1999 11:24:25 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id IAA59704 ; Thu, 30 Sep 1999 08:20:30 -0700 Received: (from agent@localhost) by unicode.org (8.9.3/8.9.3) id IAA28518; Thu, 30 Sep 1999 08:17:53 -0700 (PDT) Message-Id: <199909301517.IAA28518@unicode.org> Errors-To: root@unicode.org X-UML-Sequence: 9977 (1999-09-30 15:17:17 GMT) From: Frank da Cruz To: "Unicode List" Cc: unicode@unicode.org Date: Thu, 30 Sep 1999 08:17:12 -0700 (PDT) Subject: RE: A basic question on encoding Latin characters Karlsson Kent - keka wrote: > Frank wrote: > > or word processor (etc), is the fixed-width aspect. I can send you > > email (as I am doing now, with my medieval text-based email client) > > with every expectation that it will look the same to you as it does > > to me, even if it includes tables, source code, or anything else > > For heavens sake don't assume that! My default view of emails is via > a proportional font. And so it is for many others too. And even if > I do something to view a message via a "fixed width" font, the tab > positions are not where you had them. And I'm not too inclined to > fiddle with the tab positions, unless it is a VERY important e-mail. > This is a topic that was discussed at great length in May-July 1997 and then again in July-August 1999. The upshot is, I need to write a draft Unicode technical report to clarify what is meant by "plain text", and to propose guidelines for vendor- and application-independent self-contained preformatted Unicode plain text that can endure into the distant future and remain useful even as fads and fancies change. Anybody who would like to review the discussion so far should be able to find it in the Unicode mail archive: ftp://ftp.unicode.org/Public/MailArchive/ - Frank 20-Oct-99 5:06:39-GMT,1826;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub2.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id BAA04268 for ; Wed, 20 Oct 1999 01:06:38 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id BAA00311 for ; Wed, 20 Oct 1999 01:06:37 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id WAA54072 ; Tue, 19 Oct 1999 22:01:56 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id VAA05826; Tue, 19 Oct 1999 21:55:58 -0700 (PDT) Message-Id: <199910200455.VAA05826@unicode.org> Errors-To: root@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Disposition: inline X-UML-Sequence: 10336 (1999-10-20 04:55:46 GMT) From: Doug Ewell To: "Unicode List" Date: Tue, 19 Oct 1999 21:55:45 -0700 (PDT) Subject: Re: verification: RE: LATIN CAPITAL LETTER REVERSED K? Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by mailhub2.cc.columbia.edu id BAA04268 Gregg Reynolds wrote: > In a semi-serious vein: wouldn't box-score notation be a suitable > candidate for encoding? It's pretty standard, has a specific syntax, > and is spoken by millions. It's definitely a higher-level protocol. It's just like music notation: two-dimensional layout, uses symbols not otherwise found in plain text, and relies heavily on the relative positioning of these symbols. A computer encoding of baseball scoring notation would be cool, but it's not within the scope of Unicode. -Doug Ewell Placentia, California 20-Oct-99 7:36:40-GMT,3589;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub2.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id DAA25352 for ; Wed, 20 Oct 1999 03:36:40 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id DAA13489 for ; Wed, 20 Oct 1999 03:36:39 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id AAA21852 ; Wed, 20 Oct 1999 00:34:37 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id AAA06303; Wed, 20 Oct 1999 00:32:05 -0700 (PDT) Message-Id: <199910200732.AAA06303@unicode.org> Errors-To: root@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-UML-Sequence: 3934 (1999-10-20 07:31:27 GMT) From: peter_constable@sil.org Reply-To: unicore@unicode.org To: "Multiple Recipients of Unicore" Date: Wed, 20 Oct 1999 00:31:25 -0700 (PDT) Subject: Re: Regarding the proposal for Mathmatical alphabets Content-Transfer-Encoding: 7bit >>Unicode does not, and has never maintained that with no other external >information all text is or should be legible. >Really? What is plain text then? It is a fallacy to think that plain text *by itself* is ever fully semantically specified. If you receive a plain text document that consists of I seem to be having problems with my lifestyle. You might assume you know what the intended meaning is, i.e. that it is an English sentence, but you may be wrong. E.g., it could be a curse in the Blahurg language. The likelihood is that you'll be safe with your assumption in this case, but the possibility does exist that you'll be wrong. This is a rather contrived example, but it need not be: chat What is the meaning of this text? Is it the English word with a meaning related to 'discuss', is it a French word with the meaning 'cat', or is it something else? Plain text is defined, essentially, as a string of unadorned, abstract characters. There is nothing in the definition of plain text that says anything about the interpretation of the text being unambiguous. In the general case, plain text requires several items of additional information in order for it to be correctly interpreted, including at least the following: - encoding - character set - language You might say that the presence of xFE xFF in the first two bytes is sufficient to identify the encoding and character set, but it is not strictly sufficient. It's possible that this is a non-text binary file that happened to start with these two bytes. Likewise, language cannot in *any* case be determined from plain text with *100%* certainty. Now, it may be that in a lot of situations, one can in practice manage to correctly determine the intended interpretation of plain text without this additional information; e.g. if you get a file from a colleague, they probably don't have to identify this information for you explicitly for you to know what they're meaning to convey in the plain text file. But as a general principle, Ken's point is entirely correct. Peter 20-Oct-99 8:10:51-GMT,5781;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub2.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id EAA02003 for ; Wed, 20 Oct 1999 04:10:50 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id EAA15340 for ; Wed, 20 Oct 1999 04:10:50 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id BAA34392 ; Wed, 20 Oct 1999 01:08:16 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id BAA06466; Wed, 20 Oct 1999 01:05:47 -0700 (PDT) Message-Id: <199910200805.BAA06466@unicode.org> Errors-To: root@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit X-UML-Sequence: 3935 (1999-10-20 08:05:18 GMT) From: peter_constable@sil.org Reply-To: unicore@unicode.org To: "Multiple Recipients of Unicore" Date: Wed, 20 Oct 1999 01:05:17 -0700 (PDT) Subject: Re: Mathematical alphabets Content-Transfer-Encoding: 7bit >THEREFORE, I propose that all such math symbols be encoded in the BMP, not in plane 1, even if it is necessary to split them between blocks to squeeze them in. Can we find 2000-odd code points in BMP? Math symbols are *too important* to be relegated to plane 1. They are more important than CJKV extensions (only about 5000 han characters are in common use), and more important than the scripts mentioned above and more important than Mongolian or Tibetan. [I am not advocating removal of these scripts or han characters.] All of expanding human knowledge is developed by writing and discussing scientific papers, and this is done internationally, and as more discoveries are made, it is urgent and necessary to encode these papers in a manner in which they can be indexed, accessed, and read on computer. >Do you know what a mess it would be to have to sign every math symbol character as a surrogate? Surrogates are OK for an occasional character, but not for numerous equations. That would significantly bloat text files of scientific papers with mathematical content. And think of what it would do towards the goal of putting mathematical texts on the Internet. Math symbols in the BMP? Please, no. I don't disagree that Math is important. (Having a B. Math degree, I'm also certainly a fan.) I'm just more concerned for living scripts that might be bumped as a result. The scripts you mentioned may not be important to a lot of people, but they're extremely important to those who use them. (By the way, just wanted to clarify that Thaana and Runic are *not at all* in the same category: Thaana is very much a living script: phone books, newspapers, etc.) And there are more like them that really should go into the precious few spots remaining in the BMP. Why these in the BMP rather than math? - Math requires specialized software to handle it regardless of where it's located. If the software is only interpreting the semantics of a math string, the fact that plane 1 characters or surrogates pairs are involved is not a problem. If the software is presenting math strings, then it needs specialised code for layout of formulas anyway, so it isn't a huge burden to add the need to handle plane 1 characters. (The developers/user community in question appear to already be in agreement to using plane 1.) - Other living scripts that are potential candidates for the BMP do not require specialized software; in general, it should be possible to work with these scripts using *any* app that is designed to support BMP text, including your favourite simple, Unicode-enabled, plain text editor. Putting a living script into plane 1 introduces the likelihood that that script will be place at a significant disadvantage for some time since it can't be used by *any* Unicode-enabled tool, but must be used with a smaller set of software. This will have a far worse impact on those language communities affected than would the math community be affected by putting the math stuff in plane 1. As for text size, this is really a non-issue. In terms of storage, there is no real concern (I doubt anybody has a database with millions of records of math formulas), and in terms of transmission, it is very likely that the majority of text in a file containing math symbols will be prose text and not math formulas. The impact on the size of the math stings will likely be minimal. As for putting mathematical texts on the internet, the use of surrogates shouldn't be an issue; at least, any current concerns will be temporary limitations only. Eventually, browsers will all be able to handle surrogates, proably sooner than later. (In a browser, the main concern is the ability to render. An extension to the TrueType spec has already been made to allow for rendering of surrogates. I'll be somewhat surprised if the next version of IE doesn't have the ability to render surrogates.) If we could fit math on the BMP without any risk to living scripts for spoken languages, I'd be entirely for it. I'm not sure that's a safe assumption at this point, however. Peter 20-Oct-99 12:02:00-GMT,7395;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub3.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id IAA04424 for ; Wed, 20 Oct 1999 08:01:57 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id IAA03464 for ; Wed, 20 Oct 1999 08:01:56 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id EAA41754 ; Wed, 20 Oct 1999 04:59:44 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id EAA07168; Wed, 20 Oct 1999 04:57:09 -0700 (PDT) Message-Id: <199910201157.EAA07168@unicode.org> Errors-To: root@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-UML-Sequence: 3942 (1999-10-20 11:56:40 GMT) From: Michael Everson Reply-To: unicore@unicode.org To: "Multiple Recipients of Unicore" Date: Wed, 20 Oct 1999 04:56:39 -0700 (PDT) Subject: RE: I give up - Ballot document L2/99-330 is now plain text Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by mailhub3.cc.columbia.edu id IAA04424 Ar 11:01 -0700 1999-10-19, scrνobh Michel Suignard: >Michael, thanks as usual for your constructive comments. You're welcome. I am always happy to be of service. >The XML crap as you name it is there to provide round trip capability >for the product that created the document (Word 2000). Oh, what a good idea. Never mind the rest of us, who don't have that particular product, eh? You know, if Microsoft didn't ship stuff that screws up everything for others, I would be happy to sing its praises. I would love to be able to say "This is a really good thing, enhancing interoperability for everyone". Instead, every time Microsoft comes out with a new product or document format it seems like it chokes everyone who "lags behind" the "cutting edge". If you think this isn't true, try using a nice reliable platform with trusty software and then live in a world where people are (regularly) forced to (pay money to upgrade) to the cutting edge just to _read_ a simple document. >I inspected the source and I saw that it >was created to be read by IE4 and up level (that includes Netscape 4.x >version as well). It crashed Netscape 4.05 for the Macintosh, which I am using and have been using for some time. >I tried on both IE5 and Netscape 4.61 and both read the >info fine. If a document has to be read by down level browsers it is a good >idea to generate it with a lower level (like the version 3 of the browsers). >Doing this is an option offered by Word 2000. "Down level browsers"? That's rather arrogant. There was nothing wrong with my browser, until suddenly Microsoft's new product started spouting all kinds of junk into HTML documents which those browsers (which use standard HTML) were not designed to read. I am sure that Arnold, expert user as he is, will be comforted to know that he can produce documents formatted in a way acceptable to most of us. >If I could get in which context the problems occured (which browser and >version numbers) maybe we can look at it, as stated today there isn't much I >can do. Netscape 4.05 for the Macintosh. And apparently other people had problems as well. >Also, there is a downloadble tool in the microsoft web site that will save >Michael some time as it does already what he wants to do. It is at: >http://officeupdate.microsoft.com/2000/downloadDetails/Msohtmf2.htm >Basically the tools strip out all the information that is not used by the >browsers. I will look at this site, and am curious to know whether it supplies a Macintosh version. However, here is the logic. Arnold sends out a document. I don't know what it is about. I look at it. It crashes Netscape. I fire up this new tool to strip out the crap from it. Finally I can read the document and find out (as I have in this case) that it is irrelevant to me. What a colossal time-waster. >Finally the 'mso-bidi-font-family:Arial' property (not Ariel as mentioned by >Michael) ... I was always satisfied with "Helvetica" but note in passing that Ariel is a character in Shakespeare's Tempest. I didn't realize that "Arial" was something else. >... is there to indicate to use Arial for Bidi text in the context >used by the document author (ignored by browsers). This is the mechanism >used by Word to create font associations. The Arial font in its recent >updates support Bidi text, so I don't see what is unbelievable on that >syntax. The more you overtick the plumbing, the easier it is to stop up the drain. Here, dear colleagues, is an analysis of what Word 2000 is doing in this instance. Statistics are taken from ClarisWorks' word-count feature. Plain text HTML Crap ratio Number of characters: 1279 19772 6% content, 94% crap Number of words: 192 1085 18% content, 82% crap Number of lines: 53 636 8% content, 92% crap Number of paragraphs: 48 572 17% content, 83% crap Number of pages: 2 15 13% content, 87% crap In order to be fair to Word 2000, and considering for the sake of argument _all_ markup to be crap, I set Arnold's document as an ordinary HTML document with PageSpinner in the way I normally do. Compare the results with the above. Plain text HTML Crap ratio Number of characters: 1279 1667 77% content, 23% crap Number of words: 192 217 88% content, 18% crap Number of lines: 53 75 71% content, 29% crap Number of paragraphs: 48 67 72% content, 28% crap Number of pages: 2 2 100% content, 0% crap Gosh, Word 2000 does seem to add an awful lot of crap. Consider some mere mortals like my mother and my brother, as opposed to us highly-motivated experts. If we have these problems, what hope have the teeming millions? The "unbelievability" of Arial being described as "mso-bidi-font-family:Arial" is an indication of the total waste to be incurred on my poor mother's hard disk, on the bandwidth carrying the message, and on my poor brother's hard disk, when all that my mother was trying to send was a 192-word message about visiting at Christmas. What is unbelievability is the crudeness of the hack. This 26-character string "mso-bidi-font-family:Arial" is repeated _45_ times in Arnold's document! That's 1170 characters, a mere 6% of the 94% of the crap in that document. And bidirectionality is totally irrelevant to my mother and my brother, isn't it? It took me about an hour and a half to deal with this situation including writing this e-mail. I'm of a mind to send this to the Unicode list, but I suppose I won't. Am I Microsoft bashing? I don't think I am. I think they've done something they should quickly undo, and with an apology besides. -- Michael Everson * Everson Gunn Teoranta * http://www.indigo.ie/egt 15 Port Chaeimhghein Νochtarach; Baile Αtha Cliath 2; Ιire/Ireland Guthαn: +353 1 478 2597 ** Facsa: +353 1 478 2597 (by arrangement) 27 Pαirc an Fhιithlinn; Baile an Bhσthair; Co. Αtha Cliath; Ιire 20-Oct-99 13:13:47-GMT,1371;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub3.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id JAA16826 for ; Wed, 20 Oct 1999 09:13:40 -0400 (EDT) Received: (from fdc@localhost) by watsun.cc.columbia.edu (8.8.5/8.8.5) id JAA10333; Wed, 20 Oct 1999 09:12:18 -0400 (EDT) Date: Wed, 20 Oct 99 9:12:18 EDT From: Frank da Cruz To: unicore@unicode.org Subject: RE: Regarding the proposal for Mathmatical alphabets In-Reply-To: Your message of Tue, 19 Oct 1999 19:45:34 -0700 (PDT) Message-ID: > Perhaps this is a legacy of too much emphasis on legibility of plain text. > The world has progessed. HTML mail really is better than plain text mail. > Yes, there are systems and mail handlers that can't cope but if you keep > singing the plain text mantra they will never cope and the users are the > losers. > So you think plain text should be replaced by HTML? And then all the software on earth should be changed to be "HTML-compliant"? Which HTML? How often must all the software in the world be changed to keep up with it? What happens when HTML itself is overtaken by some new buzzword? Plain text has value. It's like air or water. Take it away and you'll see. - Frank 20-Oct-99 13:15:37-GMT,1804;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub1.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id JAA03029 for ; Wed, 20 Oct 1999 09:15:36 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id JAA10982 for ; Wed, 20 Oct 1999 09:15:35 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id GAA23588 ; Wed, 20 Oct 1999 06:13:45 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id GAA07511; Wed, 20 Oct 1999 06:11:04 -0700 (PDT) Message-Id: <199910201311.GAA07511@unicode.org> Errors-To: root@unicode.org X-UML-Sequence: 3947 (1999-10-20 13:10:25 GMT) From: Frank da Cruz Reply-To: unicore@unicode.org To: "Multiple Recipients of Unicore" Date: Wed, 20 Oct 1999 06:10:23 -0700 (PDT) Subject: RE: Regarding the proposal for Mathmatical alphabets > Perhaps this is a legacy of too much emphasis on legibility of plain text. > The world has progessed. HTML mail really is better than plain text mail. > Yes, there are systems and mail handlers that can't cope but if you keep > singing the plain text mantra they will never cope and the users are the > losers. > So you think plain text should be replaced by HTML? And then all the software on earth should be changed to be "HTML-compliant"? Which HTML? How often must all the software in the world be changed to keep up with it? What happens when HTML itself is overtaken by some new buzzword? Plain text has value. It's like air or water. Take it away and you'll see. - Frank 20-Oct-99 13:44:47-GMT,2016;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub1.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id JAA07848 for ; Wed, 20 Oct 1999 09:44:46 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id JAA18257 for ; Wed, 20 Oct 1999 09:44:44 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id GAA27356 ; Wed, 20 Oct 1999 06:43:04 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id GAA07651; Wed, 20 Oct 1999 06:40:12 -0700 (PDT) Message-Id: <199910201340.GAA07651@unicode.org> Errors-To: root@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-UML-Sequence: 3949 (1999-10-20 13:39:41 GMT) From: "Walt Daniels" Reply-To: unicore@unicode.org To: "Multiple Recipients of Unicore" Date: Wed, 20 Oct 1999 06:39:40 -0700 (PDT) Subject: RE: Regarding the proposal for Mathmatical alphabets Content-Transfer-Encoding: 7bit >>HTML mail really is better than plain text mail. >Why? Just to take a simple example, italic and bold carry important distinctions which make meaning clearer. Both are missing from plain text email unless you consider ***bold*** to be a substitute. Or more importantly to me HTML mail reformats paragraphs to fit the available screen width. Most mail programs just wrap plain text in stupid ways or force you to scroll horizontally. I think I can read HTML mail at least 50% faster. Don't forget that we got at least that much speedup of reading when most people finally gave up all uppercase plain text mail. I don't have any proof but I think I understand written material better if I can read it as fast as I think without being slowed down for some artificial reason. 20-Oct-99 14:24:48-GMT,2642;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub3.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id KAA01992 for ; Wed, 20 Oct 1999 10:24:45 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id KAA26170 for ; Wed, 20 Oct 1999 10:24:43 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id HAA27546 ; Wed, 20 Oct 1999 07:23:10 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id HAA07808; Wed, 20 Oct 1999 07:20:26 -0700 (PDT) Message-Id: <199910201420.HAA07808@unicode.org> Errors-To: root@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-UML-Sequence: 3952 (1999-10-20 14:19:57 GMT) From: Mark Leisher Reply-To: unicore@unicode.org To: "Multiple Recipients of Unicore" Date: Wed, 20 Oct 1999 07:19:55 -0700 (PDT) Subject: Plain text vs. rich text [was RE: Regarding ....] Content-Transfer-Encoding: 7bit Walt> Perhaps this is a legacy of too much emphasis on legibility of plain Walt> text. The world has progessed. HTML mail really is better than Walt> plain text mail. Yes, there are systems and mail handlers that Walt> can't cope but if you keep singing the plain text mantra they will Walt> never cope and the users are the losers. You've got to be kidding! Of the thousands of web sites I have browsed, I can count the number I find legible on one hand. It is so bad now that I don't even bother reading web pages; I just look for URL's. And don't blame the systems and mail handlers for not dealing with markup because most of them do so in one way or another. The point is that those of us working in plain text are doing so by choice, not necessity. And why do we choose plain text? In my case, I find that text with markup noticeably slows my reading speed and comprehension. In short, if the text isn't "designed" well enough, I can't read it and usually just delete it. ----------------------------------------------------------------------------- Mark Leisher Computing Research Lab The first virtue is to restrain the tongue; New Mexico State University he approaches nearest to the gods who knows Box 30001, Dept. 3CRL how to be silent, even though he is in the Las Cruces, NM 88003 right. -- Cato the Younger (95-46 B.C.E) 20-Oct-99 17:36:39-GMT,2126;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub2.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id NAA07494 for ; Wed, 20 Oct 1999 13:36:39 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id NAA14786 for ; Wed, 20 Oct 1999 13:36:38 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id KAA41418 ; Wed, 20 Oct 1999 10:33:48 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id KAA09331; Wed, 20 Oct 1999 10:31:22 -0700 (PDT) Message-Id: <199910201731.KAA09331@unicode.org> Errors-To: root@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-UML-Sequence: 3966 (1999-10-20 17:30:26 GMT) From: Mark Leisher Reply-To: unicore@unicode.org To: "Multiple Recipients of Unicore" Date: Wed, 20 Oct 1999 10:30:24 -0700 (PDT) Subject: Re: Plain text vs. rich text [was RE: Regarding ....] Content-Transfer-Encoding: 7bit Michael> I can't quite tell why you aren't using some kind of browser that Michael> can read HTML, Mark. That's different from HTML e-mail, though. I am one of those people who find most of the "rich text" out there in presentation form too distracting to really be useful. Call it "attention deficit disorder," "aesthetic elitism," "Luddism," or whatever, I just find the majority of documents on the web as seen through a browser unpalatable to the point of being unreadable. ----------------------------------------------------------------------------- Mark Leisher Computing Research Lab The first virtue is to restrain the tongue; New Mexico State University he approaches nearest to the gods who knows Box 30001, Dept. 3CRL how to be silent, even though he is in the Las Cruces, NM 88003 right. -- Cato the Younger (95-46 B.C.E) 20-Oct-99 18:09:07-GMT,2689;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub3.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id OAA15595 for ; Wed, 20 Oct 1999 14:09:04 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA21864 for ; Wed, 20 Oct 1999 14:09:03 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id LAA32464 ; Wed, 20 Oct 1999 11:05:57 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id LAA09613; Wed, 20 Oct 1999 11:03:26 -0700 (PDT) Message-Id: <199910201803.LAA09613@unicode.org> Errors-To: root@unicode.org X-UML-Sequence: 3972 (1999-10-20 18:02:49 GMT) From: Frank da Cruz Reply-To: unicore@unicode.org To: "Multiple Recipients of Unicore" Date: Wed, 20 Oct 1999 11:02:44 -0700 (PDT) Subject: Re: I give up - Ballot document L2/99-330 is now plain text > Not that this view will be listened to by anyone. But I don't think > it's an unreasonable view. > I agree with it wholeheartedly and greatly enjoyed your rant. I appreciate it especially because I use a plain-text non-MIME email client, so when people send *me* html, I see html. When they send me anything encoded in base64, I see base64. Now I can understand why they might want do this for pictures or a sound clip, but for a few lines of text??? Why would I use a plain-text, non-MIME email client in this day and age? Because it does everything I want it to, it's stable, it doesn't infect my my computer with viruses, and I have the source code and can fix it if I have to. And because I'm a fast touch-typer -- in the time it takes me to reach for the mouse and hunt for some tiny widget to click on, I can whiz through 20 email messages, deleting the 15 of them that are junk-mail (which, by the way, is almost always filled with Michael's famous "crap" :-) With email, I have the same feeling about plain text as I do about handwriting in postal mail. If a hand-addressed letter arrives, it gets top priority. If a glossy multicolored item with glaring headlines arrives, it goes directly into the trash. I think the Universal Character Set is best understood -- and in fact should ONLY be understood -- with reference to plain text. Of course it CAN be used in all kinds of GUI Web browsers, office suites, etc, but it must not depend on notions that only apply to GUIs, because all such notions are ephemeral. Plain text is forever. - Frank 20-Oct-99 19:10:06-GMT,1472;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub1.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id PAA23222 for ; Wed, 20 Oct 1999 15:10:05 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id PAA06834 for ; Wed, 20 Oct 1999 15:10:04 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id MAA38140 ; Wed, 20 Oct 1999 12:07:07 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id MAA09931; Wed, 20 Oct 1999 12:04:18 -0700 (PDT) Message-Id: <199910201904.MAA09931@unicode.org> Errors-To: root@unicode.org X-UML-Sequence: 3978 (1999-10-20 19:03:43 GMT) From: Rick McGowan Reply-To: unicore@unicode.org To: "Multiple Recipients of Unicore" Date: Wed, 20 Oct 1999 12:03:41 -0700 (PDT) Subject: Re: I give up - Ballot document L2/99-330 is now plain text Frank said... > all such notions are ephemeral. > Plain text is forever. Hmmm. Speaking stylistically, "ephemeral" has too many syllables to be good poetry. If you're going to make a hummable tune for the refrain, try: All such notions be conceit, but Plain Text is for-e-ver... Kind of an "Ein feste Burg" for the Unicodification Church... Rick 20-Oct-99 19:35:47-GMT,2097;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub2.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id PAA12066 for ; Wed, 20 Oct 1999 15:35:46 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id PAA14032 for ; Wed, 20 Oct 1999 15:35:45 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id MAA47028 ; Wed, 20 Oct 1999 12:33:22 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id MAA10118; Wed, 20 Oct 1999 12:30:44 -0700 (PDT) Message-Id: <199910201930.MAA10118@unicode.org> Errors-To: root@unicode.org X-UML-Sequence: 3981 (1999-10-20 19:30:12 GMT) From: kenw@sybase.com (Kenneth Whistler) Reply-To: unicore@unicode.org To: "Multiple Recipients of Unicore" Cc: kenw@sybase.com Date: Wed, 20 Oct 1999 12:30:10 -0700 (PDT) Subject: The Unic Ode Rick, Hmm. In addition to being misguided about mathematical truth ;-), now you are offending my sense of metrics. > > Frank said... > > > all such notions are ephemeral. > > Plain text is forever. > > Hmmm. Speaking stylistically, "ephemeral" has too many syllables to be good > poetry. If you're going to make a hummable tune for the refrain, try: > > All such notions be conceit, > but Plain Text is for-e-ver... > > Kind of an "Ein feste Burg" for the Unicodification Church... Hummed to Ein feste Burg, these lines require the addition of phantom syllables for the extra notes: All such no(uh)tions be(ee) conceit, but Plai(ai)n Text is for(or)-e-ver.. Whereas, Frank's text scans as a perfect iambic pentameter blank verse couplet: But all such notions are ephemeral, - ' - ' - ' - ' - ' And plain text is forever -- Frank da Cruz. - ' - ' - ' - ' - ' A suitable contribution to "The Unic Ode". --Ken > > > Rick > 20-Oct-99 19:49:50-GMT,1803;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub2.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id PAA15547 for ; Wed, 20 Oct 1999 15:49:50 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id PAA16760 for ; Wed, 20 Oct 1999 15:49:49 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id MAA52622 ; Wed, 20 Oct 1999 12:47:31 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id MAA10629; Wed, 20 Oct 1999 12:44:45 -0700 (PDT) Message-Id: <199910201944.MAA10629@unicode.org> Errors-To: root@unicode.org Mime-Version: 1.0 (Apple Message framework v123.1) Content-Type: text/plain; charset=utf-8 X-UML-Sequence: 3982 (1999-10-20 19:44:12 GMT) From: Rick McGowan Reply-To: unicore@unicode.org To: "Multiple Recipients of Unicore" Date: Wed, 20 Oct 1999 12:44:10 -0700 (PDT) Subject: Re: The Unic Ode > Hummed to Ein feste Burg, these lines require the addition of > phantom syllables for the extra notes: Ah, sorry to put you off the track... I was merely *comparing* this couplet to the grandeur of "Ein feste Burg" as an expression of lofty, noble, and eternal thought. The tune I actually had in mind was an unkempt English folksong whose title I can't recall at the moment... I'm sure it'll come to me after a few pints of grog... Whereas, without the addition of "Frank da Cruz"​himself into the couplet, the prior result was a line of iambic pentameter followed by a three-legged iambic pentametrical wannabe... Rick 21-Oct-99 10:32:44-GMT,2292;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub1.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id GAA06069 for ; Thu, 21 Oct 1999 06:32:44 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id GAA28045 for ; Thu, 21 Oct 1999 06:32:43 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id DAA24904 ; Thu, 21 Oct 1999 03:28:50 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id DAA14640; Thu, 21 Oct 1999 03:23:49 -0700 (PDT) Message-Id: <199910211023.DAA14640@unicode.org> Errors-To: root@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-UML-Sequence: 4002 (1999-10-21 10:20:23 GMT) From: Michael Everson Reply-To: unicore@unicode.org To: "Multiple Recipients of Unicore" Date: Thu, 21 Oct 1999 03:20:13 -0700 (PDT) Subject: Re: Regarding the proposal for Mathematical alphabets Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by mailhub1.cc.columbia.edu id GAA06069 Ar 17:08 -0700 1999-10-20, scrνobh peter_constable@sil.org: > >We are not encoding mathematics. > > >We are encoding the characters needed to represent (most) > current typographical > practice in international mathematical text. > > If all this is for is typography and nothing more, then fonts > and styles are sufficient, and the arguments I've presented > regarding symbolic computation are completely irrelevant. But it's not. It seems simple: mathematicians want to represent their data in plain text, and have shown that they can do so with this solution, and that they can't with other solutions. The question is, should they be facilitated in this, or should they not? -- Michael Everson * Everson Gunn Teoranta * http://www.indigo.ie/egt 15 Port Chaeimhghein Νochtarach; Baile Αtha Cliath 2; Ιire/Ireland Guthαn: +353 1 478 2597 ** Facsa: +353 1 478 2597 (by arrangement) 27 Pαirc an Fhιithlinn; Baile an Bhσthair; Co. Αtha Cliath; Ιire 21-Oct-99 16:00:00-GMT,3638;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub3.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id LAA04222 for ; Thu, 21 Oct 1999 11:59:56 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id LAA19683 for ; Thu, 21 Oct 1999 11:59:55 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id IAA36152 ; Thu, 21 Oct 1999 08:57:53 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id IAA17412; Thu, 21 Oct 1999 08:55:19 -0700 (PDT) Message-Id: <199910211555.IAA17412@unicode.org> Errors-To: root@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-UML-Sequence: 4009 (1999-10-21 15:54:42 GMT) From: "Lee Collins" Reply-To: unicore@unicode.org To: "Multiple Recipients of Unicore" Cc: on@ams.org, bnb@ams.org, "'unicore@unicode.org'" Date: Thu, 21 Oct 1999 08:54:41 -0700 (PDT) Subject: Re: Regarding the proposal for Mathmatical alphabets Content-Transfer-Encoding: 7bit Lee's response to Murray, >> We are willing to drive Japan and other countries to adopt non-Unicode >> solutions because we have forced a model of text on them that they find >> inconvenient to implement. Why are mathematicians more important? >> >I find both the premise and conclusion to be invalid. In Unicode, we are >providing an architecture to fully support Han characters, many in the BMP >and many more in higher planes. Han characters are clearly extremely >important in Unicode and are one of the major reasons for Unicode's >phenomenal success in the computing industry. It appears that you are a new-comer to the history of Han characters in Unicode. The point here is that Unicode is unwilling to treat Han characters the same way that some mathematicians (my mathematician friend who uses serif / sans-serif was actually trying to point out that the distinctions mathematicians want are open-ended) want their characters treated. I am not trying to belittle mathematical usage. A Japanese user would like to see the actual Japanese forms of Han characters in plain text and sometimes be able to mix in other Han languages. They would like to search on Japanese text and not hit Chinese han characters. They believe their characters to be as different from Chinese as some of the mathematicians believe that bold forms of the Roman alphabet are different from plain forms. Unicode does not provide an architecture to support Han characters the way that most users of Han characters want them supported. Despite the many attempts to revise its history, Unicode in fact was never meant to support plain or simple (no layout required) text. If it had been meant to support plain text, we would have started thinking in terms of a full 32 bit encoding since even in 1988 it was clear that 16 bits would not be sufficient for a plain-text model. The solutions offered to Han characters users were always couched in terms of some form of attributed text. Over the years, Unicode has given in to smaller, more aggressive constituencies who argued the need for handling their favorite set of characters in plain text and who managed to find a champion in the UTC. The result is that some areas might be capable of being handled in plain text, but not the largest and most controversial sub-range, the Han. Lee 22-Oct-99 1:13:33-GMT,7665;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub3.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id VAA15513 for ; Thu, 21 Oct 1999 21:13:32 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id VAA17669 for ; Thu, 21 Oct 1999 21:13:31 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id SAA34430 ; Thu, 21 Oct 1999 18:08:26 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id SAA23182; Thu, 21 Oct 1999 18:04:03 -0700 (PDT) Message-Id: <199910220104.SAA23182@unicode.org> Errors-To: root@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-UML-Sequence: 10403 (1999-10-22 01:03:47 GMT) From: "Reynolds, Gregg" To: "Unicode List" Date: Thu, 21 Oct 1999 18:03:42 -0700 (PDT) Subject: RE: character semiotics (was RE: Mixed up priorities) Hi Andrea, > -----Original Message----- > From: A. Vine [mailto:avine@eng.sun.com] > Sent: Thursday, October 21, 1999 6:16 PM > "Reynolds, Gregg" wrote: > But, Gregg, what is meaning, after all? Is 'f' a semiotic > unit to you? Is > 'I'? Does 'I' hold a greater significance than 'f' because > it has another > meaning? Why is 'f' encoded and not 'if'? If 'if' were what > the average > English speaker would identify as a single letter, is it > sufficient to say it's > encoded as 'i' + 'f'? > Good questions, for which I think we can come up with workable (if not theologically and cosmically "true") answers. By workable I mean something along the lines of "a system of terms, definitions, etc. that serves to closely model the 'real' semiotics of written language (and thus answer to the expectations of literate communities) for the purposes of formal language design and software specification (thus answering to the expectations of software vendors)". I think it is possible to agree on a fairly precise set of formal definitions for modeling written language by drawing on linguistics, semiotics, mathematical logic, etc. Stuff that's been around for quite some time, actually. The first thing I would note is that our ordinary means of discourse is incredibly impoverished when it comes to talking about written language. We have quote marks and and various typographic conventions and that's about it. Very flexible, but not very precise. So take for example your question "is 'f' a semiotic unit": you could be referring to the graphical thingee, or the phonological thingee, or a third thingee, or maybe even the sign-function thingee that ties some or all of these other thingees together. Etc. (I would answer yes in each case.) Great fun, actually. And I'm not picking on your usage; examine almost any piece of writing by any specialist that discusses grammatology, and you'll find it shot through with informal usage that relies on the reader to figure out which register to use in interpreting things like 'f' (or should that be "'f'"?). Watch how often it happens on this list. In any case, to answer your questions, I would start by positing that we need to model two things at least, one being the visual aspect of written language (graphemes, visual syntax, etc.) (i.e. the signifiers), and the other being the things denoted by such forms. I think Unicode works on the former, not the latter. I don't have a good term for the latter yet, but for now let's call them "grammemes". ("Cultural unit" is a tad too general and would cover just about everything. I guess we could go for the TLA: GCU = grammatical cultural unit. Wheee!) Grammemes are not phonemes. Research has shown that reading does not necessarily involve phonological activity in the brain. (If you're interested I can supply the references). The set of grammemes associated with a particular written language amounts to a theory of language. They represent the cognitive categories literates use to think about language, and don't necessarily follow modern linguistic analyis. "Grammeme" because the line between basic units such as "letters" in the traditional sense and higher-level grammatical concepts is blurry in some languages. Arabic provides several examples, ta marbuta being the most obvious. Either a medial ta form or a final dotted heh form may represent ta marbuta in Arabic, but the name "ta marbuta" itself denotes a complex packaging of rules relating phonology, morphology, and syntax. It is not considered an element of the traditional Arabic alphabet, but it is definitely part of basic Arabic orthography and literacy - one should be able to search on it, for example. So it's a grammeme. I seem to have slipped into dissertation mode again. Sorry 'bout that. To get back to your questions, I would say that by 'f' we designate a pairing of graphic form and grammeme - a sign-function, in semiotic terms. 'I' is another; the fact that it can enter into other semiotic (lexical) relations can be disregarded, since our guide is the set of 'letter' grammemes associated with (pick your language.) 'if' is not encoded because the community of literates doesn't think of the graphic form as denoting a single irreducable grammeme - if it did, then it would merit a code point, as 'ch' in some languages surely does. This does not mean that the graphical form used to represent it cannot be analyzed into consituent parts that are themselves encoded. It would not be problematic to say that the grammeme 'if' may be represented visually by the sequence of two _graphemes_ 'i' and 'f'. But "grammeme i" plus "grammeme f" does not equal "grammeme if" though they might equal "lexeme if" - that would be for higher level protocols to decide. U+0BCA TAMIL VOWEL SIGN O, I am willing to bet, is considered by Tamil literates a single form denoting a single grammeme. But it would be entirely reasonable to analyze the form used to denote that grammeme into its constituent parts and encode them separately _qua graphic forms_ without a corresponding grammeme denotatum. > Is Unicode's lack of capturing the semiotics of written > language a by-product of > its philosophy of characters, I think so. Also of its notion of plain text, and the whole underlying notion of "script without language". It's not the worst idea in the world, but it comes at a cost, and I've never seen a real careful analysis of what we (well, not me and my pals but certainly others) give up by adopting Unicode's modeling strategy. > or a result of the restrictions > imposed on it by > existing computer systems and software? Must have had a lot to do with it. But on the other hand, I don't think a more balanced approach would necessarily mean software designs incompatible with today's software. If it were a question of standardizing widget interfaces it wouldn't matter much, but we're talking about standardizing a model of language, which is pretty close to home for everybody. Add another possible cause: specialization. Very few people are insane enough to try to master the disparate fields (computer science, mathematical logic, linguistics, textual theory, psycholinguistics, etc etc) that converge here. Most of the people in the humanities with whom I've discussed Unicode have almost no clue as to what plain text is, let alone how formal modeling works. I don't mean that the people involved are not qualified, only that the pool is pretty small. Cheers, Gregg 22-Oct-99 1:40:55-GMT,2435;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub1.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id VAA03943 for ; Thu, 21 Oct 1999 21:40:55 -0400 (EDT) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id VAA21021 for ; Thu, 21 Oct 1999 21:40:54 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id SAA48014 ; Thu, 21 Oct 1999 18:29:46 -0700 Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id SAA23478; Thu, 21 Oct 1999 18:22:47 -0700 (PDT) Message-Id: <199910220122.SAA23478@unicode.org> Errors-To: root@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" X-UML-Sequence: 10404 (1999-10-22 01:22:34 GMT) From: "Reynolds, Gregg" To: "Unicode List" Date: Thu, 21 Oct 1999 18:22:31 -0700 (PDT) Subject: RE: Mixed up priorities > -----Original Message----- > From: John Hudson [mailto:tiro@tiro.com] > Sent: Thursday, October 21, 1999 7:19 PM > > At 04:49 PM 21-10-99 -0700, G. Adam Stanislav wrote: > > with them, but is it true that these sorting and hyphenation rules > _require_ encoding of these digraphs as precomposed characters? > > specific sorting and hyphenation rules. Are you suggesting > that each of > these sequences _needs_ to be encoded as a precomposed character? > > Again, is it _necessary_ for this behaviour to be controlled > by encoding > these letters as individual, precomposed characters? If there > Why is the burden of proof on the users of the language? I would turn the question around: is it really _necessary_ to leave slovak/czech "ch" out of Unicode? > Remember that Unicode is a standard for encoding _plain > text_. Unicode does > not contain sorting rules for individual languages, nor does > it contain > hyphenation rules for individual languages. Unicode provides I don't see what plaintext, sorting and hyphenation have to do with it. Slovak and Czech literates have this thing within their culture, and they use "ch" denote it. So if plaintext doesn't accomodate "ch", then it must not be plain text for Slovaks and Czechs. Why do we need more information than that? Utterly perplexed, Gregg 2-Nov-99 4:50:10-GMT,8856;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub1.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id XAA09403 for ; Mon, 1 Nov 1999 23:50:07 -0500 (EST) Received: from humbolt.nl.linux.org (humbolt.geo.uu.nl [131.211.28.48]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id XAA21631 for ; Mon, 1 Nov 1999 23:50:06 -0500 (EST) Received: by humbolt.nl.linux.org id ; Tue, 2 Nov 1999 05:34:17 +0100 Received: from kiev.wall.org ([205.178.11.135]:37094 "EHLO kiev.wall.org" smtp-auth: ) by humbolt.nl.linux.org with ESMTP id ; Tue, 2 Nov 1999 05:33:42 +0100 Received: by kiev.wall.org (8.9.3/8.9.3) id UAA27045; Mon, 1 Nov 1999 20:28:40 -0800 (PST) Date: Mon, 1 Nov 1999 20:28:40 -0800 (PST) From: Larry Wall Message-Id: <199911020428.UAA27045@kiev.wall.org> To: Markus.Kuhn@cl.cam.ac.uk (Markus Kuhn) Cc: perl-unicode@perl.org, linux-utf8@nl.linux.org Subject: Re: Correct use of UTF-8 under Unix In-Reply-To: (from Markus Kuhn on Fri, 29 Oct 1999 11:09:52 +0100) X-Orcpt: rfc822;linux-utf8@nl.linux.org Sender: owner-linux-utf8@nl.linux.org Precedence: bulk Reply-To: linux-utf8@nl.linux.org Markus Kuhn writes: : I have just read through the list archive, and noted that a few people : might have some doubts about how UTF-8 is used under Unix. Well, I just read through your list archive, and I think you are more of an idealist than I can afford to be. You keep saying, "If Plan 9 can do a complete conversion, so can we." But you'll notice that people aren't in fact using Plan 9, by and large. Plan 9 is a research project. It doesn't have millions of installations or millions of interconnections with other installations. Don't get me wrong. Perl will work fine in your idealized world. But I intend it to work okay in the other world too. I simultaneously try to keep my head in the clouds and my feet on the ground. Sometimes it's a stretch, though. : They : apparently got confused by many of the features described in the Unicode : standard (BOM, line separator, etc.), and thereby completely forgot the : big UTF-8 prime directive under Unix: : : UTF-8 is ASCII compatible Sure, and Perl banks on that to a great extent, but much of the world is not ASCII compatible. : Not only the encoding, but also the use of it. Er, only until you actually start trying to use it for anything both useful and un-American, like sorting, or updating your screen... : So don't change anything : about how ASCII was used when introducing UTF-8, because only this means : that UTF-8 can truly substitute ASCII in a realistic way: To the extent possible, I agree with you. :-) : This means the following: : : - A UTF-8 Unix plain text file that contains only ASCII characters : (and this is the majority of files on Unix installations all over : the world) will *not* change a single bit. That may be true, but I don't think it's true enough. 49% of the files in the world could be in non-ASCII, and your statement would still be strictly true. But not terribly useful. The problem is not so much files as it is interfaces. What percentage of the text you use comes from the system you're on? How is that percentage changing over time? What about if you're running a Linux set-top box that doesn't even have a disk? Or closer to current reality, did that tar file you just unpacked come from a UTF-8 only system? Will your browser convert text to UTF-8 when it saves it? What's coming down that socket you just opened? What's coming out of the file descriptor my process just inherited? Was it a pipe to a process on my machine, or was it a foreign port? I'm not suggesting there is an easy answer to this. In fact, I'm suggesting there isn't. And that any suggestion that there is isn't. : - This means that there is never a BOM at the start of a file. BOMs could : be ignored by special new Unicode programs, but they are definitely : not ignored by the many existing ASCII programs. Adding a : BOM would break a tremendous amount of things and would violate the : prime directive, as BOMs are definitely not ASCII compatible. I don't like BOMs either, in case you missed that. Of course, I loathe UTF-16 too, so that's not too terribly surprising. Surrogate characters are too pukey to contemplate. : - This means that lines in UTF-8 plaintext files are terminated : in one and only one way: 0x0a = LF. Neither U+2028 (line separator, : introduced for use inside *.doc-style word processing binary files) : nor overly long UTF-8 sequences for LF such as 0x80 0x8a must be accepted : as line terminators, otherwise we would get into the horrible : scenario that programs start to disagree what exactly a line is : (which a whole load of new security risks associated). Programs : such as "wc -l" must on UTF-8 files without any modification : whatsoever! There is no reason to change the Unix line semantics when : moving from ASCII to UTF-8. U+2028 is treated just like any other : character and has no special meaning in a Unix plaintext file. Fine by me, till someone asks to treat a file otherwise, in which case they should be let. What's more at issue is whether a *file* should be able to request being treated otherwise, if we give the user the right to request that files be given the right to request that they be so treated. Or some such. :-) : How do applications find out that files are now in UTF-8? Simple : applications such as cat and echo do not have to. For them UTF-8 is : just like ASCII. You oversimplify again. Even "cat -v" has to know how to treat bytes with the high bit set. And "echo -e" probably wants a way to interpolate characters larger than can be interpolated by \nnn. : However, programs which count characters, position : cursors, determine character classes, use regexp, etc. have to know : about the file encoding, and there are well-established mechanisms to do : that: they are told, preferably via established POSIX mechanisms : (LC_CTYPE, LANG), or via other command line switches. You have a major showstopper here as far as us Perl folks are concerned. Neither the environment nor the command line can be trusted in a setuid situation. The Perl community is for this reason particularly leary of anything having to do with locales. I noticed that you frequently invoke the name of POSIX on your mailing list, but that won't work here. Around here people will actually shudder if you say "POSIX". : Ideally, all that should be necessary to turn a Unix installation into a : pure UTF-8 system is the addition of the line : : export LC_CTYPE=UTF-8 : : in /etc/profile, plus conversion of the existing ISO 8859, JIS, KOI8, : etc. files and file names. No. It is not ideal. If you're going to have a kernel-wide switch, then ideally the kernel should tell the process. The environment simply cannot be trusted, any historical POSIX botches to the contrary notwithstanding. You've been arguing for LC_CTYPE for several months now. I hope you haven't argued for it for so long that you can't see its problems anymore. As for Perl, although it will ideally keep everything as UTF-8 internally, it'll still be assuming that it has to know on an interface-by-interface basis whether to expect UTF-8 or something else. Even on your idealized Linux, we'll still have to know what to do with the sockets connected to the real world. It is not so much more of a stretch for us to decide on a file-by-file basis, using the best available information. On your ideal system, the best available information might be that we should always guess files to be UTF-8. That's fine. But please don't use the environment to convey such important, system-wide information. : Editors and terminal emulators will then : activate their UTF-8 modes, email software will convert received : messages from the indicated MIME character set into UTF-8 before saving : them as a file, etc. We are not quite there yet, but that should be the : long-term goal. I would like that too. But Perl has always been about getting from here to there, and this is very much a getting-from-here-to-there problem. Nevertheless, I do appreciate idealists--at least as long as they're not collectivizing the peasants, some of whom were my third cousins living in the Ukraine before they were starved to death. So I feel I owe it to them to be able to distinguish Unicode from Russian. When the whole world joins your collective, I'll say I believed in it all along. :-) Larry - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/ 2-Nov-99 13:54:02-GMT,13157;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub2.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id IAA08748 for ; Tue, 2 Nov 1999 08:53:57 -0500 (EST) Received: from humbolt.nl.linux.org (humbolt.geo.uu.nl [131.211.28.48]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id IAA17428 for ; Tue, 2 Nov 1999 08:53:56 -0500 (EST) Received: by humbolt.nl.linux.org id ; Tue, 2 Nov 1999 14:24:27 +0100 Received: from mailgw.imt.im.se ([195.100.17.67]:44542 "EHLO mail-gw.imt.im.se" smtp-auth: ) by humbolt.nl.linux.org with ESMTP id ; Tue, 2 Nov 1999 14:24:02 +0100 Received: from msxsth1.im.se (msxsth1.im.se [193.14.16.108]) by mail-gw.imt.im.se (8.9.3/8.9.3) with ESMTP id OAA24644; Tue, 2 Nov 1999 14:21:09 +0100 Received: by msxsth1 with Internet Mail Service (5.5.2650.21) id ; Tue, 2 Nov 1999 14:23:06 +0100 Message-ID: From: Karlsson Kent - keka To: "'linux-utf8@nl.linux.org'" Cc: perl-unicode@perl.org Subject: RE: Correct use of UTF-8 under Unix Date: Tue, 2 Nov 1999 14:21:40 +0100 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2650.21) Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01BF2535.65AC0570" X-Orcpt: rfc822;linux-utf8@nl.linux.org Sender: owner-linux-utf8@nl.linux.org Precedence: bulk Reply-To: linux-utf8@nl.linux.org This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. ------_=_NextPart_001_01BF2535.65AC0570 Content-Type: text/plain; charset="iso-8859-1" (Note: I don't subscribe to perl-unicode@perl.org, only to linux-utf8@nl.linux.org, and I don't have Markus's original message that is quoted below.) > : - This means that lines in UTF-8 plaintext files are terminated > : in one and only one way: 0x0a = LF. That is not true. "lines" in UTF-8 text files may be terminated by LINE FEED, CARRIAGE RETURN, CARRIAGE RETURN+LINE FEED, NEXT LINE, or end-of-file, or be separated by LINE SEPARATOR or PARAGRAPH SEPARATOR (which is in some sense 'stronger' than line separator). (I don't know what originally came before the "This means that" in Markus's message.) > Neither U+2028 (line separator, > : introduced for use inside *.doc-style word processing binary files) That is not true. LINE SEPARATOR and PARAGRAPH SEPARATOR were once introduced in the hope that they would "clear up the line ending mess". (Whether they are used in ".doc"-style documents is a separate issue.) That hope has not come to fruition yet, and it will take time before the "line ending mess" is overcome whatever way is used to overcome it. Unicode Technical Report 13, Unicode Newline Guidelines (http://www.unicode.org/unicode/reports/tr13/), gives some guidelines on how to increase the interoperability with regard to "new line function" (NLF) and LS/PS handling. Basically the recommendation is to accept all commonly occurring NLFs: CR, CR+LF, LF, the EBCDIC originated NL (NEXT LINE; U+0085; admittedly rare), as well as LS and PS (and allow EOF to 'terminate a line'). I think they should be accepted in any mixture. Most(?) C compilers already appear to handle at least both LF and CR+LF (mixed) fairly well. This makes it easier to handle C source files in a "mixed environment". Shell scripts, yacc/bison files, etc. are still problematic since their lexers still expect only LF. > : nor overly long UTF-8 sequences for LF such as 0x80 0x8a must be accepted True, unduly long UTF-8 encodings in general should be considered malformed. > : as line terminators, otherwise we would get into the horrible > : scenario that programs start to disagree what exactly a line is > : (which a whole load of new security risks associated). Programs > : such as "wc -l" must on UTF-8 files without any modification > : whatsoever! There is no reason to change the Unix line semantics when > : moving from ASCII to UTF-8. U+2028 is treated just like any other > : character and has no special meaning in a Unix plaintext file. U+2028 and U+2029 should be handled as just another way of indicating line separation/end (as should end-of-file) for the purposes of perl/C/lex/bison/Ada/etc. Neither of these need to distinguish between line and paragraph separation, and all of these ways of terminating/separating lines should be treated the same, for increased interoperability. Of course, to be able to detect NL, LS, and PS one needs to know the character encoding first, since they have different codes and are indeed not possible to represent in all encodings. But the same goes for NL and CR too really, if UTF-16 is allowed, which it should be in at least some circumstances. (No, I don't like little endianism nor "BOM".) Note that several programming languages, e.g. Java, Ada, and C, allow non-ASCII in identifiers, with identifier identity defined via the UCS. But they don't require a particular character encoding for the source files, so compilers for these programming languages MUST 'know' the character encoding of an individual source file (via a compiler flag, system/individual/folder default, or similar) in order to compile the source code correctly anyway. Similarly for XML and its tag and attribute names, but each XML file should self-declare which character encoding it is in. Which way of ending/terminating lines should be prefered on output? Might depend on a preference setting, or an editing change (like "turn all NLFs into LS"). Kind regards /Kent Karlsson ------_=_NextPart_001_01BF2535.65AC0570 Content-Type: text/html; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable RE: Correct use of UTF-8 under Unix

    (Note: I don't subscribe to perl-unicode@perl.org, = only to
    linux-utf8@nl.linux.org, and I don't have Markus's = original
    message that is quoted below.)


    > :   - This means that lines in UTF-8 = plaintext files are terminated
    > :     in one and only one = way: 0x0a =3D LF.

    That is not true.  "lines" in UTF-8 = text files may be terminated by
    LINE FEED, CARRIAGE RETURN, CARRIAGE RETURN+LINE = FEED, NEXT LINE,
    or end-of-file, or be separated by LINE SEPARATOR or = PARAGRAPH SEPARATOR
    (which is in some sense 'stronger' than line = separator).

    (I don't know what originally came before the = "This means that" in
    Markus's message.)


    >       =         =         =         =         =         =         Neither U+2028 (line = separator,
    > :     introduced for use = inside *.doc-style word processing binary files)

    That is not true.  LINE SEPARATOR and PARAGRAPH = SEPARATOR were once
    introduced in the hope that they would "clear = up the line ending mess".
    (Whether they are used in ".doc"-style = documents is a separate issue.)
    That hope has not come to fruition yet, and it will = take time before
    the "line ending mess" is overcome = whatever way is used to overcome it.
    Unicode Technical Report 13, Unicode Newline = Guidelines
    (http://www.unicode.org/unicode/reports/tr13/), = gives some guidelines
    on how to increase the interoperability with regard = to "new line
    function" (NLF) and LS/PS handling.  = Basically the recommendation is
    to accept all commonly occurring NLFs: CR, CR+LF, = LF, the EBCDIC
    originated NL (NEXT LINE; U+0085; admittedly rare), = as well as
    LS and PS (and allow EOF to 'terminate a = line').  I think they
    should be accepted in any mixture.

    Most(?) C compilers already appear to handle at least = both LF and CR+LF
    (mixed) fairly well.  This makes it easier to = handle C source files in a
    "mixed environment".  Shell scripts, = yacc/bison files, etc. are still
    problematic since their lexers still expect only = LF.


    > :     nor overly long UTF-8 = sequences for LF such as 0x80 0x8a must be accepted

    True, unduly long UTF-8 encodings in general should = be considered malformed.


    > :     as line terminators, = otherwise we would get into the horrible
    > :     scenario that = programs start to disagree what exactly a line is
    > :     (which a whole load = of new security risks associated). Programs
    > :     such as "wc = -l" must on UTF-8 files without any modification
    > :     whatsoever! There is = no reason to change the Unix line semantics when
    > :     moving from ASCII to = UTF-8. U+2028 is treated just like any other
    > :     character and has no = special meaning in a Unix plaintext file.

    U+2028 and U+2029 should be handled as just another = way of
    indicating line separation/end (as should = end-of-file) for the
    purposes of perl/C/lex/bison/Ada/etc. Neither of = these need to
    distinguish between line and paragraph separation, = and all of
    these ways of terminating/separating lines should be = treated
    the same, for increased interoperability.

    Of course, to be able to detect NL, LS, and PS one = needs to know
    the character encoding first, since they have = different codes and
    are indeed not possible to represent in all = encodings.  But the
    same goes for NL and CR too really, if UTF-16 is = allowed, which
    it should be in at least some circumstances. (No, I = don't like
    little endianism nor "BOM".)

    Note that several programming languages, e.g. Java, = Ada, and C,
    allow non-ASCII in identifiers, with identifier = identity defined
    via the UCS. But they don't require a particular = character
    encoding for the source files, so compilers for = these programming
    languages MUST 'know' the character encoding of an = individual
    source file (via a compiler flag, = system/individual/folder default,
    or similar) in order to compile the source code = correctly anyway.
    Similarly for XML and its tag and attribute names, = but each XML
    file should self-declare which character encoding it = is in.

    Which way of ending/terminating lines should be = prefered on output?
    Might depend on a preference setting, or an editing = change (like
    "turn all NLFs into LS").

            =         Kind = regards
            =         /Kent = Karlsson

    ------_=_NextPart_001_01BF2535.65AC0570-- - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/ 2-Nov-99 14:47:11-GMT,7230;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub3.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id JAA24891 for ; Tue, 2 Nov 1999 09:47:08 -0500 (EST) Received: from humbolt.nl.linux.org (humbolt.geo.uu.nl [131.211.28.48]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id JAA29433 for ; Tue, 2 Nov 1999 09:47:07 -0500 (EST) Received: by humbolt.nl.linux.org id ; Tue, 2 Nov 1999 15:04:34 +0100 Received: from heaton.cl.cam.ac.uk ([128.232.32.11]:37138 "EHLO heaton.cl.cam.ac.uk" smtp-auth: ) by humbolt.nl.linux.org with ESMTP id ; Tue, 2 Nov 1999 15:03:56 +0100 Received: from trillium.cl.cam.ac.uk ([128.232.8.5] helo=cl.cam.ac.uk ident=mgk25) by heaton.cl.cam.ac.uk with esmtp (Exim 3.01 #1) id 11ieXO-0004Xl-00; Tue, 02 Nov 1999 14:03:50 +0000 X-Mailer: exmh version 2.0.2+CL 2/24/98 To: "'linux-utf8@nl.linux.org'" , perl-unicode@perl.org Subject: Re: Correct use of UTF-8 under Unix In-reply-to: Your message of "Tue, 02 Nov 1999 14:21:40 +0100." X-URL: http://www.cl.cam.ac.uk/~mgk25/ Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Date: Tue, 02 Nov 1999 14:03:47 +0000 From: Markus Kuhn Message-Id: X-Orcpt: rfc822;linux-utf8@nl.linux.org Sender: owner-linux-utf8@nl.linux.org Precedence: bulk Reply-To: linux-utf8@nl.linux.org Karlsson Kent - keka wrote on 1999-11-02 13:21 UTC: > (Note: I don't subscribe to perl-unicode@perl.org, only to > linux-utf8@nl.linux.org, and I don't have Markus's original > message that is quoted below.) > > > > : - This means that lines in UTF-8 plaintext files are terminated > > : in one and only one way: 0x0a = LF. > > That is not true. "lines" in UTF-8 text files may be terminated by > LINE FEED, CARRIAGE RETURN, CARRIAGE RETURN+LINE FEED, NEXT LINE, > or end-of-file, or be separated by LINE SEPARATOR or PARAGRAPH SEPARATOR > (which is in some sense 'stronger' than line separator). The crucial bit of my original message that you missed was: I have just read through the list archive, and noted that a few people might have some doubts about how UTF-8 is used under Unix. They apparently got confused by many of the features described in the Unicode standard (BOM, line separator, etc.), and thereby completely forgot the big UTF-8 prime directive under Unix: UTF-8 is ASCII compatible Not only the encoding, but also the use of it. So don't change anything about how ASCII was used when introducing UTF-8, because only this means that UTF-8 can truly substitute ASCII in a realistic way: This means the following: - A UTF-8 Unix plain text file that contains only ASCII characters (and this is the majority of files on Unix installations all over the world) will *not* change a single bit. [...] There are many nice ideas written up in the Unicode standard and the associated technical reports, however they are not a dogma and each idea has to be critically reviewed before you even consider introducing them into an existing environment. It should become very quickly clear to the alert reader of these documents that many of the mechanisms described there (most notably the byte-order-mark and the new-line semantics) are irrelevant for the use of UTF-8 as a backwards compatible migration path for ASCII plaintext files on Unix systems. Unix never had any new line ambiguity. It was always LF and only LF. It would be really foolish for us to introduce a brand new new-line ambiguity (via say the line separator) on Unix systems just because we read about shiny new alternative ways in a Unicode technical report. The original AT&T Bell Labs developers of Unix have already studied back in 1992, how ISO 10646 is best used on Unix-style systems. They concluded to replace ASCII completely by UTF-8 on their experimental Unix-successor system Plan9 and reported about the excellent practical experiences that they made in this process in a now legendary USENIX paper, which I am sure you all are well familiar with: ftp://ftp.informatik.uni-erlangen.de/pub/doc/ISO/charsets/UTF-8-Plan9-paper.ps.gz If the outside world does something different (they always have, you listed the three most popular other newline conventions CR, CRLF, and NL, yourself), then we will continue to convert, either automatically or manually, as appropriate. The handling of newline ambiguity by C under Unix has always been a NOP. C under Unix is to completely ignore the "b" mode option of fopen(). The "b" option is a hack for the rest of the world to allow it to handle its Unix incompatibilities. I have nothing against introducing besides the normal "plain text" also a new text file format that we could call "unformatted plain text". It would be a stream of characters interrupted by Unicode paragraph separator characters. The PS and LS characters would have exactly the same role as a

    and
    in HTML or a \par and \hfil\break in TeX. Such an additional file type notion would indeed be interesting to have available, but it would not be used for formatted plain text files such as - software source code - configuration files - shell scripts - everything sent to standard output etc. for obvious reasons of backwards compatibility. An unformatted text format (and a whole range of new tools or new modes of existing tools to support handling it) would however be very convenient for file types such as - HTML/SGML/XML - TeX - nroff where the formatting of the plain-text file is discarded anyway. It would save us having to press paragraph-reformat so frequently in editors, and it would make diff files smaller, because paragraphs would not contain any more any formatting indicators such as LF that have to be rearranged throughout the entire paragraph is you change just a single word. For normal "plain text" files, the process writing a paragraph has fixed the positions of the line breaks, for "unformatted plain text" files, the process reading the paragraphs is responsible to think about placing line breaks. Just as in TeX, HTML, etc. There is nothing wrong, with having these note-pad style unformatted plain text files as well supported under Unix, but it is important to make clear that this is an entirely new file type with no relationship to the existing plaintext notion. The distinction of the two file types is easy: If it contains at least one LF character, it is a normal plain text file, if it does not contain a single LF character (but zero or more PS and/or LS characters), then is is a new/style unformatted plaintext file. Either way, you'll find out soon enough when reading the file at the end of the first line (formatted) or paragraph (unformatted). Markus -- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/ 4-Nov-99 2:46:24-GMT,4746;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub2.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id VAA21148 for ; Wed, 3 Nov 1999 21:46:20 -0500 (EST) Received: from humbolt.nl.linux.org (humbolt.geo.uu.nl [131.211.28.48]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id VAA24083 for ; Wed, 3 Nov 1999 21:46:20 -0500 (EST) Received: by humbolt.nl.linux.org id ; Thu, 4 Nov 1999 02:34:42 +0100 Received: from kiev.wall.org ([205.178.11.135]:45804 "EHLO kiev.wall.org" smtp-auth: ) by humbolt.nl.linux.org with ESMTP id ; Thu, 4 Nov 1999 02:34:07 +0100 Received: by kiev.wall.org (8.9.3/8.9.3) id RAA10793; Wed, 3 Nov 1999 17:31:16 -0800 (PST) Date: Wed, 3 Nov 1999 17:31:16 -0800 (PST) From: Larry Wall Message-Id: <199911040131.RAA10793@kiev.wall.org> To: Markus.Kuhn@cl.cam.ac.uk (Markus Kuhn) Cc: "'linux-utf8@nl.linux.org'" , perl-unicode@perl.org Subject: Re: Correct use of UTF-8 under Unix In-Reply-To: (from Markus Kuhn on Tue, 02 Nov 1999 14:03:47 +0000) X-Orcpt: rfc822;linux-utf8@nl.linux.org Sender: owner-linux-utf8@nl.linux.org Precedence: bulk Reply-To: linux-utf8@nl.linux.org Markus Kuhn writes: : There is nothing wrong, with having these note-pad style unformatted : plain text files as well supported under Unix, but it is important to : make clear that this is an entirely new file type with no relationship : to the existing plaintext notion. : : The distinction of the two file types is easy: If it contains at least : one LF character, it is a normal plain text file, if it does not contain : a single LF character (but zero or more PS and/or LS characters), then : is is a new/style unformatted plaintext file. Either way, you'll find : out soon enough when reading the file at the end of the first line : (formatted) or paragraph (unformatted). I have one quibble with your hard and fast distinction between the two file types here. And that is that Perl scripts themselves might want to be both types simultaneously! It's considered good style to put the documentation into the same file as the code it documents, and while the code certainly wants to be newline delimited, the documentation is in POD format, and it would be perfectly fine to treat POD text paragraphs as a word processor would. In fact, POD was specifically designed so that filled paragraphs could be distinguished from non-filled text on the basis of the first character of the paragraph. The only problem I see offhand with allowing both styles in the same file is that different tools might count lines differently. If Perl says there's a syntax error at line 582, it might mean it has seen 581 instances of /\012 | \015\012 | \015 | \X{2028} | \X{2029}/x before the error. (For folks listening in, that works out to Unix newline, Windows newline, Mac newline (!), Unicode line separator and Unicode paragraph separator.) If your "normal plain text" editor then counts only \012 (Unix newline), the programmer isn't going to be able to find the error. On the other hand, maybe Perl would just count newlines, and your editor counts it the other way. More likely, some editors count one way, and other editors count another. Maybe they count LS but not PS, just as Perl currently counts \n but not \f as a line transition. There are many possiblities. All I'm really arguing here is that it would be good to establish a line counting convention. But if that convention pretends there won't be files mixing the two line delimitation styles, that will have other ramifications, including possibly an adverse impact on portability. Counting line numbers right is already pretty complicated when you have NFS mounts from foreign systems. Adding in Unicode will only make things more complicated. There will be some pressure to use Unicode LS/PS in portable code, and I'm not sure you want to spend the rest of your life resisting that pressure. A lot of the "fixes" in Perl are only there because we got tired of people asking the same questions over and over. I think assuming that files will only be one style or the other will put us into that sort of a situation, and it would be nice to head it off early, for some definition of early. Just telling people by fiat that they can't mix the two styles is not likely to work in the absence of universal education. Unfortunately, the education of the illegitimi tends to result in carborundum. Larry - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/ 4-Nov-99 12:26:34-GMT,11587;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub2.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id HAA29245 for ; Thu, 4 Nov 1999 07:26:28 -0500 (EST) Received: from humbolt.nl.linux.org (humbolt.geo.uu.nl [131.211.28.48]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id HAA11601 for ; Thu, 4 Nov 1999 07:26:27 -0500 (EST) Received: by humbolt.nl.linux.org id ; Thu, 4 Nov 1999 12:46:15 +0100 Received: from mailgw.imt.im.se ([195.100.17.67]:30090 "EHLO mail-gw.imt.im.se" smtp-auth: ) by humbolt.nl.linux.org with ESMTP id ; Thu, 4 Nov 1999 12:45:45 +0100 Received: from msxsth1.im.se (msxsth1.im.se [193.14.16.108]) by mail-gw.imt.im.se (8.9.3/8.9.3) with ESMTP id MAA22788; Thu, 4 Nov 1999 12:43:17 +0100 Received: by msxsth1 with Internet Mail Service (5.5.2650.21) id ; Thu, 4 Nov 1999 12:45:07 +0100 Message-ID: From: Karlsson Kent - keka To: "'linux-utf8@nl.linux.org'" Cc: perl-unicode@perl.org Subject: RE: Correct use of UTF-8 under Unix Date: Thu, 4 Nov 1999 12:43:40 +0100 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2650.21) Content-Type: multipart/alternative; boundary="----_=_NextPart_001_01BF26BA.0A66AFB0" X-Orcpt: rfc822;linux-utf8@nl.linux.org Sender: owner-linux-utf8@nl.linux.org Precedence: bulk Reply-To: linux-utf8@nl.linux.org This message is in MIME format. Since your mail reader does not understand this format, some or all of this message may not be legible. ------_=_NextPart_001_01BF26BA.0A66AFB0 Content-Type: text/plain; charset="iso-8859-1" Hi! Larry is right in that there is (already, also under Unix) other ways of separating lines: namely form feed, but also vertical tab. I must admit that I have never used vertical tab, and very rarely form feed... Anyway C9x says: "\v (vertical tab) Moves the active position to the initial position of the next vertical tab position." And there is a similar statement about form feed. I assume that is not too far off from what other standards might say. So, the interoperable line (or 'stronger') separators in "plain text" are: \X{2028}|\X{2029}|\r\n|\n|\r|\f|\v|\X{85} (I'm probably mixing Perl and C (and flex) syntax here.) Some of them are "stronger" in some senses than line separation, but for the purposes of counting logical lines, and deciding logical line begin and logical line end, there should be no difference. A single logical line may be *dynamically* wrapped into several displayed lines, but that is a different matter. Note that there are some "legacy" encodings which do not have any or all of \f|\v|\X{85}. (I still think the idea of having two different kinds of "plain text" is a bad idea. I haven't heard anyone else entertain it either.) Kind regards /Kent K Larry Wall wrote: ... > The only problem I see offhand with allowing both styles in the same > file is that different tools might count lines differently. If Perl > says there's a syntax error at line 582, it might mean it has seen 581 > instances of /\012 | \015\012 | \015 | \X{2028} | \X{2029}/x > before the > error. (For folks listening in, that works out to Unix > newline, Windows > newline, Mac newline (!), Unicode line separator and Unicode paragraph > separator.) If your "normal plain text" editor then counts only \012 > (Unix newline), the programmer isn't going to be able to find > the error. > > On the other hand, maybe Perl would just count newlines, and your > editor counts it the other way. More likely, some editors count one > way, and other editors count another. Maybe they count LS but not PS, > just as Perl currently counts \n but not \f as a line transition. > There are many possiblities. > > All I'm really arguing here is that it would be good to establish a > line counting convention. But if that convention pretends there won't > be files mixing the two line delimitation styles, that will have other > ramifications, including possibly an adverse impact on portability. > Counting line numbers right is already pretty complicated > when you have > NFS mounts from foreign systems. Adding in Unicode will only make > things more complicated. There will be some pressure to use Unicode > LS/PS in portable code, and I'm not sure you want to spend the rest of > your life resisting that pressure. A lot of the "fixes" in Perl are > only there because we got tired of people asking the same questions > over and over. > > I think assuming that files will only be one style or the other will > put us into that sort of a situation, and it would be nice to head it > off early, for some definition of early. Just telling people by fiat > that they can't mix the two styles is not likely to work in > the absence > of universal education. Unfortunately, the education of the > illegitimi > tends to result in carborundum. > > Larry > - > Linux-UTF8: i18n of Linux on all levels > Archive: http://mail.nl.linux.org/lists/ > ------_=_NextPart_001_01BF26BA.0A66AFB0 Content-Type: text/html; charset="iso-8859-1" RE: Correct use of UTF-8 under Unix

    Hi!

            Larry is right in that there is (already, also under Unix)
    other ways of separating lines: namely form feed, but also vertical
    tab. I must admit that I have never used vertical tab, and very
    rarely form feed... Anyway C9x says: "\v (vertical tab) Moves the
    active position to the initial position of the next vertical tab
    position." And there is a similar statement about form feed. I
    assume that is not too far off from what other standards might say. 

            So, the interoperable line (or 'stronger') separators in
    "plain text" are:

            \X{2028}|\X{2029}|\r\n|\n|\r|\f|\v|\X{85}

    (I'm probably mixing Perl and C (and flex) syntax here.) Some
    of them are "stronger" in some senses than line separation,
    but for the purposes of counting logical lines, and deciding
    logical line begin and logical line end, there should be no
    difference.  A single logical line may be *dynamically* wrapped
    into several displayed lines, but that is a different matter.

            Note that there are some "legacy" encodings which do not
    have any or all of \f|\v|\X{85}.

            (I still think the idea of having two different kinds
    of "plain text" is a bad idea.  I haven't heard anyone else
    entertain it either.)

                    Kind regards
                    /Kent K


    Larry Wall wrote:
    ...
    > The only problem I see offhand with allowing both styles in the same
    > file is that different tools might count lines differently.  If Perl
    > says there's a syntax error at line 582, it might mean it has seen 581
    > instances of /\012 | \015\012 | \015 | \X{2028} | \X{2029}/x
    > before the
    > error.  (For folks listening in, that works out to Unix
    > newline, Windows
    > newline, Mac newline (!), Unicode line separator and Unicode paragraph
    > separator.)  If your "normal plain text" editor then counts only \012
    > (Unix newline), the programmer isn't going to be able to find
    > the error.
    >
    > On the other hand, maybe Perl would just count newlines, and your
    > editor counts it the other way.  More likely, some editors count one
    > way, and other editors count another.  Maybe they count LS but not PS,
    > just as Perl currently counts \n but not \f as a line transition.
    > There are many possiblities.
    >
    > All I'm really arguing here is that it would be good to establish a
    > line counting convention.  But if that convention pretends there won't
    > be files mixing the two line delimitation styles, that will have other
    > ramifications, including possibly an adverse impact on portability.
    > Counting line numbers right is already pretty complicated
    > when you have
    > NFS mounts from foreign systems.  Adding in Unicode will only make
    > things more complicated.  There will be some pressure to use Unicode
    > LS/PS in portable code, and I'm not sure you want to spend the rest of
    > your life resisting that pressure.  A lot of the "fixes" in Perl are
    > only there because we got tired of people asking the same questions
    > over and over.
    >
    > I think assuming that files will only be one style or the other will
    > put us into that sort of a situation, and it would be nice to head it
    > off early, for some definition of early.  Just telling people by fiat
    > that they can't mix the two styles is not likely to work in
    > the absence
    > of universal education.  Unfortunately, the education of the
    > illegitimi
    > tends to result in carborundum.
    >
    > Larry
    > -
    > Linux-UTF8:   i18n of Linux on all levels
    > Archive:      http://mail.nl.linux.org/lists/
    >

    ------_=_NextPart_001_01BF26BA.0A66AFB0-- - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/ 4-Nov-99 7:43:22-GMT,3063;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub1.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id CAA23594 for ; Thu, 4 Nov 1999 02:43:19 -0500 (EST) Received: from humbolt.nl.linux.org (humbolt.geo.uu.nl [131.211.28.48]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id CAA15155 for ; Thu, 4 Nov 1999 02:43:19 -0500 (EST) Received: by humbolt.nl.linux.org id ; Thu, 4 Nov 1999 08:29:16 +0100 Received: from robin.camelot.de ([195.30.224.3]:25607 "EHLO mail.camelot.de" smtp-auth: ) by humbolt.nl.linux.org with ESMTP id ; Thu, 4 Nov 1999 08:27:56 +0100 Received: from robin.camelot.de (uucp@robin.camelot.de [195.30.224.3]) by mail.camelot.de (8.9.3/8.9.3) with ESMTP id IAA96660; Thu, 4 Nov 1999 08:27:49 +0100 (CET) Received: from oas.a2e.de (uucp@localhost) by robin.camelot.de (8.9.3/8.9.3) with UUCP id IAA96657; Thu, 4 Nov 1999 08:27:49 +0100 (CET) Received: from localhost by wtao97 via sendmail with esmtp id for ; Thu, 4 Nov 1999 08:25:48 +0100 (CET) (Smail-3.2 1996-Jul-4 #1 built 1999-Jul-23) Date: Thu, 4 Nov 1999 08:25:47 +0100 (CET) From: PILCH Hartmut X-Sender: phm@wtao97.oas.a2e.de To: linux-utf8@nl.linux.org cc: Markus Kuhn Subject: filetype field? In-Reply-To: <199911032207.XAA11389@moolenaar.net> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-Orcpt: rfc822;linux-utf8@nl.linux.org Sender: owner-linux-utf8@nl.linux.org Precedence: bulk Reply-To: linux-utf8@nl.linux.org On Wed, 3 Nov 1999, Bram Moolenaar wrote: > > - The Unix kernel #!/bin/sh mechanism will break, because the > > file will not start any more with #! > > Good point. Putting the BOM in the second line would work. But that's a bit > strange. It would be better to adjust the kernel to handle UTF-8 files, and > thus ignore the BOM in this position. Just one more place that needs to be > UTF-8 aware, not a big deal. If the kernel is to look for a UTF-8 BOM, it might as well look for a general encoding marker. That seems to be what you are using the BOM for. There is no byte order to be marked in UTF-8 texts, is there? If the kernel is to be changed, why not go to the roots and introduce an filetype field into the inode table, similar to the permissions field, with commands like $ chft "text/plain; charset=utf-8" file1.txt $ chft "text/plain; charset=iso-8859-1" file2.txt $ chft "image/png" file.png and a /etc/filetypes table that associates mime types to code numbers in a tending-to-become-standardized way? That could at least ensure that no BOMs are misplaced during $ cat file1.txt file2.txt > file.txt and might solve a lot of other problems. -- phm - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/ 10-Nov-99 19:46:33-GMT,5140;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub2.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id OAA15790 for ; Wed, 10 Nov 1999 14:46:32 -0500 (EST) Received: from humbolt.nl.linux.org (humbolt.geo.uu.nl [131.211.28.48]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA17442 for ; Wed, 10 Nov 1999 14:46:30 -0500 (EST) Received: by humbolt.nl.linux.org id ; Wed, 10 Nov 1999 20:08:31 +0100 Received: from montreal.alis.com ([199.84.165.66]:5050 "EHLO montreal.alis.com" smtp-auth: ) by humbolt.nl.linux.org with ESMTP id ; Wed, 10 Nov 1999 20:08:06 +0100 Received: from fyergeau2 (intralan.alis.com [199.84.165.3]) by montreal.alis.com (8.9.3/8.9.3-pl-1) with SMTP id OAA26292; Wed, 10 Nov 1999 14:06:19 -0500 (EST) From: =?ISO-8859-1?Q? "Fran=E7ois?= Yergeau" To: , Subject: RE: Unicode control characters Date: Wed, 10 Nov 1999 13:59:37 -0500 Message-ID: <000501bf2bad$bc982f50$2f8011ac@fyergeau2.intra.alis.com> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook 8.5, Build 4.71.2173.0 In-Reply-To: X-MimeOLE: Produced By Microsoft MimeOLE V4.72.2106.4 Importance: Normal X-Orcpt: rfc822;linux-utf8@nl.linux.org Sender: owner-linux-utf8@nl.linux.org Precedence: bulk Reply-To: linux-utf8@nl.linux.org > De: Markus Kuhn > Date: mercredi 10 novembre 1999 10:25 > > MIME text/plain body parts are clearly preformatted CR LF > separated lines of printable characters, and UTF-8 really should not > change anything here. Yes. In MIME text a line end is CRLF, period. > May be, it would indeed be a wise idea to supplement RFC 2279 with an > additional spec that clarifies, which Unicode characters are > allowed to > be used in a "text/plain ; charset=UTF-8" MIME part. It seems like a > good idea to explicitly exclude everything in the Cf category. This > includes the following entries from the Unicode database: I think this would be very unwise. > 200C;ZERO WIDTH NON-JOINER;Cf;0;BN;;;;;N;;;;; > 200D;ZERO WIDTH JOINER;Cf;0;BN;;;;;N;;;;; > 200E;LEFT-TO-RIGHT MARK;Cf;0;L;;;;;N;;;;; > 200F;RIGHT-TO-LEFT MARK;Cf;0;R;;;;;N;;;;; > 202A;LEFT-TO-RIGHT EMBEDDING;Cf;0;LRE;;;;;N;;;;; > 202B;RIGHT-TO-LEFT EMBEDDING;Cf;0;RLE;;;;;N;;;;; > 202C;POP DIRECTIONAL FORMATTING;Cf;0;PDF;;;;;N;;;;; > 202D;LEFT-TO-RIGHT OVERRIDE;Cf;0;LRO;;;;;N;;;;; > 202E;RIGHT-TO-LEFT OVERRIDE;Cf;0;RLO;;;;;N;;;;; These are required for minimal, understandable encoding of various languages (mainly bidi scripts, but the ZWJ and ZWNJ are also necessary for the Indic scripts, at least). These are NOT gadgets introduced by the high-flying Unicode folks solely for fancy GUI settings, they are there to meet plain text requirements. The legacy encodings for those languages that 10646/Unicode integrated (ASMO, ISCII, etc.) had similar controls, by necessity. > 206A;INHIBIT SYMMETRIC SWAPPING;Cf;0;BN;;;;;N;;;;; > 206B;ACTIVATE SYMMETRIC SWAPPING;Cf;0;BN;;;;;N;;;;; > 206C;INHIBIT ARABIC FORM SHAPING;Cf;0;BN;;;;;N;;;;; > 206D;ACTIVATE ARABIC FORM SHAPING;Cf;0;BN;;;;;N;;;;; > 206E;NATIONAL DIGIT SHAPES;Cf;0;BN;;;;;N;;;;; > 206F;NOMINAL DIGIT SHAPES;Cf;0;BN;;;;;N;;;;; These are crap. The Unicode standard 2.0 says that their use is "strongly discouraged". Note however that at least the NATIONAL DIGIT SHAPE stuff does come, AFAIK, from old encodings used in terminal applications, not fancy GUI stuff. > FEFF;ZERO WIDTH NO-BREAK SPACE;Cf;0;BN;;;;;N;BYTE ORDER MARK;;;; Ah! the infamous BOM! It has a valid use in plain text in indicating a place where a word break should not occur. A bit fancy for email, but plain text has other uses than email. > FFF9;INTERLINEAR ANNOTATION ANCHOR;Cf;0;BN;;;;;N;;;;; > FFFA;INTERLINEAR ANNOTATION SEPARATOR;Cf;0;BN;;;;;N;;;;; > FFFB;INTERLINEAR ANNOTATION TERMINATOR;Cf;0;BN;;;;;N;;;;; These are specifically defined not to be used in plain text. They are meant to be used, for instance, in a word processor that keeps text and "markup" (formatting info) in separate memory structures. The markup can use those beasties as place holders in the text, to indicate where the markup applies. > 070F;SYRIAC ABBREVIATION MARK;Cf;0;BN;;;;;N;;;;; > 180B;MONGOLIAN FREE VARIATION SELECTOR ONE;Cf;0;BN;;;;;N;;;;; > 180C;MONGOLIAN FREE VARIATION SELECTOR TWO;Cf;0;BN;;;;;N;;;;; > 180D;MONGOLIAN FREE VARIATION SELECTOR THREE;Cf;0;BN;;;;;N;;;;; > 180E;MONGOLIAN VOWEL SEPARATOR;Cf;0;BN;;;;;N;;;;; > > which I am not sure about what they are good for (seems to be new in > 3.0). The scripts are new to 3.0. I'm quite sure those guys were not introduced lightly and have a real requirement in plain text for the relevant scripts. - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/ 20-Dec-99 2:19:12-GMT,3300;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub3.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id VAA16870 for ; Sun, 19 Dec 1999 21:19:05 -0500 (EST) Received: from humbolt.nl.linux.org (humbolt.geo.uu.nl [131.211.28.48]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id VAA19947 for ; Sun, 19 Dec 1999 21:19:04 -0500 (EST) Received: by humbolt.nl.linux.org id ; Mon, 20 Dec 1999 03:17:46 +0100 Received: from kiev.wall.org ([205.178.11.135]:20989 "EHLO kiev.wall.org" smtp-auth: ) by humbolt.nl.linux.org with ESMTP id ; Mon, 20 Dec 1999 03:17:12 +0100 Received: by kiev.wall.org (8.9.3/8.9.3) id SAA28653; Sun, 19 Dec 1999 18:11:51 -0800 (PST) Date: Sun, 19 Dec 1999 18:11:51 -0800 (PST) From: Larry Wall Message-Id: <199912200211.SAA28653@kiev.wall.org> To: Markus.Kuhn@cl.cam.ac.uk (Markus Kuhn) Cc: perl-unicode@perl.org, linux-utf8@nl.linux.org Subject: Re: ASCII and Unicode Quotation Marks In-Reply-To: (from Markus Kuhn on Sun, 19 Dec 1999 18:12:11 +0000) X-Orcpt: rfc822;linux-utf8@nl.linux.org Sender: owner-linux-utf8@nl.linux.org Precedence: bulk Reply-To: linux-utf8@nl.linux.org Markus Kuhn writes: : If you use any software that writes `quote', please submit to the author : a patch and point her to the above URL for background information. Thanks! Please note that if you "fix" the m4 program this way it'll break. I think the m4 style of quoting preceded any similar TeX, GNU or X Windows usage by quite a long time, and quite possibly led to those other usages. At least, m4 is the first place I ever saw that style of quoting used pervasively. Also, please don't "fix" programs like Perl or the shells, which don't use `quote' style, but rather `quote` style. So any fix like perl -pi.bak -e "s/\`/'/g;" file1 file2 ... is going to have extremely bad consequences in those programs. Frankly, I think you're going to run into a lot of people who feel as strongly about their quotes as you feel about newlines. That is, to paraphrase your newline article: While the POSIX world is in need of a new character encoding, it is definitely not in need of new quote semantics. The two are fully orthogonal issues, and the Unicode standard has nothing useful to offer for POSIX on the quote issue. Mind you, that's not my opinion, exactly. I'm considerably more easy going on the subject. But I think there will be others who are harder going, and I'm playing devil's advocate here. Standards aside, is there any *actual* use of grave and acute accents in a symmetrical `quote' fashion? Or is it merely notional? Surely under Unicode most real accents will be combining or composed characters. So why inflict a most unused symmetry condition on people who are using an actual symmetry? I don't think quoting standards is gonna cut it for the folks who feel strongly about that. You're likely to have a cultural war on your hands. Me, I'm neutral, but I'll be glad to trade with both sides... Larry - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/ 20-Dec-99 19:10:05-GMT,3411;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub3.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id OAA25710 for ; Mon, 20 Dec 1999 14:09:58 -0500 (EST) Received: from humbolt.nl.linux.org (humbolt.geo.uu.nl [131.211.28.48]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id OAA14415 for ; Mon, 20 Dec 1999 14:09:57 -0500 (EST) Received: by humbolt.nl.linux.org id ; Mon, 20 Dec 1999 20:08:17 +0100 Received: from sceaux.ilog.fr ([193.55.64.10]:36319 "EHLO sceaux.ilog.fr" smtp-auth: ) by humbolt.nl.linux.org with ESMTP id ; Mon, 20 Dec 1999 20:07:43 +0100 Received: from laposte.ilog.fr (laposte [172.17.1.6]) by sceaux.ilog.fr (8.9.3/8.9.3) with ESMTP id UAA29700; Mon, 20 Dec 1999 20:01:13 +0100 (MET) Received: from oberkampf.ilog.fr ([172.17.4.2]) by laposte.ilog.fr (8.9.3/8.9.3) with ESMTP id UAA10959; Mon, 20 Dec 1999 20:02:47 +0100 (MET) From: Bruno Haible Received: (from haible@localhost) by oberkampf.ilog.fr (8.9.2/8.9.2) id UAA03998; Mon, 20 Dec 1999 20:02:45 +0100 (MET) Date: Mon, 20 Dec 1999 20:02:45 +0100 (MET) Message-Id: <199912201902.UAA03998@oberkampf.ilog.fr> To: linux-utf8@nl.linux.org Cc: perl-unicode@perl.org, gnits@iro.umontreal.ca, rms@gnu.org Subject: Re: ASCII and Unicode Quotation Marks In-Reply-To: <199912191951.LAA16293@ferrule.cygnus.com> References: <199912191951.LAA16293@ferrule.cygnus.com> X-Orcpt: rfc822;linux-utf8@nl.linux.org Sender: owner-linux-utf8@nl.linux.org Precedence: bulk Reply-To: linux-utf8@nl.linux.org Tom Tromey writes: > Markus> If you use any software that writes `quote', please submit to > Markus> the author a patch and point her to the above URL for > Markus> background information. Thanks! > > This is standard practice for all GNU programs, including the output > of "makeinfo". I can't speak about 'makeinfo', but for the run-time messages of internationalized programs the following approach is possible: - Use a .po file in UTF-8 format, and use U+2018/U+2019 as left and right quote delimiters. - The GNU gettext library shall convert the messages from the .po file's encoding to the current locale's character set (i.e. ISO-8859-1 in many cases). This is already partially implemented. - GNU gettext uses iconv. If the U+2018/U+2019 quotes cannot be represented in the current locale's character set, iconv can choose appropriate replacement characters (acute and grave accent or, as a last fallback, the vertical apostrophe). This will be easily implemented in the free iconv implementations (glibc-iconv, libiconv, freebsd-iconv). The non-free iconv implementations are so low quality that they are unusable. - Programmers must use vertical apostrophe or double quotes in the english messages in the C source. (The standard way to put Unicode characters into a wide char string, \unnnn, is not yet implemented by most compilers.) - As a consequence, a message catalog for English must be introduced, in order to map "He said: 'Hello world!'" to "He said: \u2018Hello world!\u2019" Bruno - Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/lists/ 29-Dec-99 16:18:00-GMT,3021;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub2.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id LAA10990 for ; Wed, 29 Dec 1999 11:17:59 -0500 (EST) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id LAA28680 for ; Wed, 29 Dec 1999 11:17:58 -0500 (EST) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id IAA38410 ; Wed, 29 Dec 1999 08:13:51 -0800 Received: (from agent@localhost) by unicode.org (8.9.3/8.9.3) id IAA22384; Wed, 29 Dec 1999 08:06:30 -0800 (PST) Message-Id: <199912291606.IAA22384@unicode.org> Errors-To: root@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Disposition: inline X-UML-Sequence: 11575 (1999-12-29 16:06:19 GMT) From: Doug Ewell To: "Unicode List" Date: Wed, 29 Dec 1999 08:06:17 -0800 (PST) Subject: Re: Latin ligatures and Unicode Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by mailhub2.cc.columbia.edu id LAA10990 Throughout this whole discussion of ligatures in Latin and the proposed (or at least bandied-about) ZERO-WIDTH LIGATOR characters, one issue keeps bothering me. I'm not sure how terrific it would be to allow (or require) a ZWL and ZWNL to achieve ligation, and what new problems this would create, but at the same time I'm uncomfortable with leaving this issue up to the font vendors, and I think Marco Cimarosti summed it up perfectly: > I would like to stress one point. If I am not totally wrong, Unicode > should be a standard to encode *plain text*. > AAT, OpenType, or any other font technology should not be considered > as *prerequisites* for displaying Unicode. > Or is any particular font technology now *required* by the Unicode > standard? > Or is it now "non conformant" to use bitmapped fonts? The idea that a particular font "technology" (where I use the word in its marketing sense that is closer to "vendor's product" than "set of capabilities") is necessary to render Unicode plain text properly is the first step toward having that vendor claim that its products are the only ones that support Unicode. Obviously, some advanced font capabilities *are* necessary to render all of Unicode properly. (Note, BTW, the use of plain text to indicate italics, obviating the need for a special Unicode character to indicate this markup.) For instance, you cannot render Arabic without choosing the contextually appropriate glyph. But this is a long way from saying "You need AAT" or "You need OpenType." The latter would send the message that Unicode support requires specific vendors' products, and we could be back where we started decades ago, with each vendor devising its own character encoding solutions. -Doug Ewell Fullerton, California 29-Dec-99 17:04:10-GMT,2736;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub1.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id MAA23776 for ; Wed, 29 Dec 1999 12:04:06 -0500 (EST) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id MAA08950 for ; Wed, 29 Dec 1999 12:04:05 -0500 (EST) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id IAA15946 ; Wed, 29 Dec 1999 08:57:42 -0800 Received: (from agent@localhost) by unicode.org (8.9.3/8.9.3) id IAA22969; Wed, 29 Dec 1999 08:50:26 -0800 (PST) Message-Id: <199912291650.IAA22969@unicode.org> Errors-To: root@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-UML-Sequence: 11578 (1999-12-29 16:50:15 GMT) From: John Jenkins To: "Unicode List" Date: Wed, 29 Dec 1999 08:50:14 -0800 (PST) Subject: Re: Latin ligatures and Unicode Content-Transfer-Encoding: 7bit on 12/29/99 5:11 AM, Marco.Cimarosti@icl.com at Marco.Cimarosti@icl.com wrote: > I would like to stress one point. If I am not totally wrong, Unicode should > be a standard to encode *plain text*. > AAT, OpenType, or any other font technology should not be considered as > *prerequisites* for displaying Unicode. > Or is any particular font technology now *required* by the Unicode standard? > Or is it now "non conformant" to use bitmapped fonts? > AAT, OpenType, or some equivalent technology is and always has been a prerequisite for displaying Unicode. The standard has been designed from the beginning with the assumption that an intelligent rendering engine is available which can implement the character-glyph model in some fashion and display N characters using M glyphs with rearrangement and reshaping along the way. Unicode has also made the assumption that out-of-band information is required to provide the full range of "proper" display required by users -- e.g., in Unihan where it's acknowledged that Japanese readers won't want to see characters written using Taiwanese glyphs. "Plain text" in Unicode means (theoretically) the minimal amount of information for legible display. In this sense, using bitmapped fonts is conformant if and only if the bitmap font technology can implement the character-glyph model and would be better off if some kind of outside markup were available to finesse the display and provide the not-plain-text information. ===== John H. Jenkins jenkins@apple.com tseng@blueneptune.com http://www.blueneptune.com/~tseng 30-Dec-99 0:27:17-GMT,5909;000000000001 Return-Path: Received: from watsun.cc.columbia.edu (watsun.cc.columbia.edu [128.59.39.2]) by mailhub1.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id TAA04123 for ; Wed, 29 Dec 1999 19:27:16 -0500 (EST) Received: from public.lists.apple.com (public.lists.apple.com [17.254.0.151]) by watsun.cc.columbia.edu (8.8.5/8.8.5) with ESMTP id TAA20491 for ; Wed, 29 Dec 1999 19:27:16 -0500 (EST) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by public.lists.apple.com (8.9.1a/8.9.1) with ESMTP id QAA35958 ; Wed, 29 Dec 1999 16:25:42 -0800 Received: (from agent@localhost) by unicode.org (8.9.3/8.9.3) id QAA25814; Wed, 29 Dec 1999 16:20:30 -0800 (PST) Message-Id: <199912300020.QAA25814@unicode.org> Errors-To: root@unicode.org X-UML-Sequence: 11589 (1999-12-30 00:20:19 GMT) From: Kenneth Whistler To: "Unicode List" Cc: unicode@unicode.org, kenw@sybase.com Date: Wed, 29 Dec 1999 16:20:16 -0800 (PST) Subject: Re: Latin ligatures and Unicode John Jenkins replied to Marco Cimarosti: > on 12/29/99 5:11 AM, Marco.Cimarosti@icl.com at Marco.Cimarosti@icl.com > wrote: > > > I would like to stress one point. If I am not totally wrong, Unicode should > > be a standard to encode *plain text*. > > AAT, OpenType, or any other font technology should not be considered as > > *prerequisites* for displaying Unicode. > > Or is any particular font technology now *required* by the Unicode standard? > > Or is it now "non conformant" to use bitmapped fonts? > > > > AAT, OpenType, or some equivalent technology is and always has been a > prerequisite for displaying Unicode. The standard has been designed from > the beginning with the assumption that an intelligent rendering engine is > available which can implement the character-glyph model in some fashion and > display N characters using M glyphs with rearrangement and reshaping along > the way. I think John's response is a bit overblown. It is true that the designers of the Unicode Standard have always (meaning since 1988 at least) assumed the availability of an "intelligent rendering engine" as part of the text handling model for Unicode. But in so doing they were thinking, from the outset, about the issues of combining marks, ligatures and conjuncts, bidirectional text handling, and other complexities inherent to the full scope of written text. It was obvious from the start that no character-cell terminal with bitmaps was up to the general task, and that a several-layer abstraction between characters in a text backing store and dots in a display raster was going to be necessary to do justice to the general problem of rendering. BUT... conformance to the Unicode Standard does *not* mean that you have to implement a rendering engine that can handle Arabic, Khmer, *and* Mongolian to professional typesetting specifications. One could be implementing a Braille device driver that uses Unicode 3.0 Braille symbol character codes for transmission, and that does not use *any* font at all for rendering, for example. It is also conformant to make use of Unicode chart fonts, with fixed glyph shapes associated with fixed character codes -- as long as the process that is doing so is doing so intentionally and is not making bogus claims about correct visual layout of Arabic, for example. The production of the standard itself makes such *conformant* use of a chart font to enable the printing of the code charts. And since no Unicode implementation is forced to interpret all Unicode characters, it is perfectly possible to constrain one's interpreted repertoire to some fixed small set that *can* be implemented with a simple one-to-one character-to-glyph representation model. As long as the Unicode text content is rendered *legibly*, in accordance with the intended semantics of the characters, and is not "garbaged" by a misinterpretation of the intended values of the characters, that would have to be considered conformant. And finally, there are plenty of "backend" processing implementations of the Unicode Standard that have no rendering -- and that therefore do not have to worry about the complexities of visual display. > > Unicode has also made the assumption that out-of-band information is > required to provide the full range of "proper" display required by users -- > e.g., in Unihan where it's acknowledged that Japanese readers won't want to > see characters written using Taiwanese glyphs. > > "Plain text" in Unicode means (theoretically) the minimal amount of > information for legible display. > > In this sense, using bitmapped fonts is conformant if and only if the bitmap > font technology can implement the character-glyph model and would be better > off if some kind of outside markup were available to finesse the display and > provide the not-plain-text information. > I don't think this "if and only if" statement can hold for Unicode implementations in general. Bitmap fonts would be hard-pressed to deal with the minimal display requirements for many complex scripts, but it is not beyond the realm of engineering possibility to keep extending existing approaches. For complex scripts it just isn't worth the effort, basically, when better approaches using "smart" outline fonts exist. But in any case, the requirements for legible display of a given piece of well-formed Unicode text vary from script to script -- and not all require the same level of sophistication that Arabic or Mongolian do, for example. And Marco, you can put your mind at ease. You will search long and hard -- and in vain -- in the Unicode Standard, Version 3.0 for any formal conformance statement that would require an implementation to make use of a *particular* font technology -- or indeed, of any font technology at all -- in order to be conformant to the standard. --Ken 23-Mar-2000 18:48:24-GMT,2968;000000000001 Return-Path: Received: from mailrelay1.cc.columbia.edu (mailrelay1.cc.columbia.edu [128.59.35.143]) by uhaligani.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id NAA02998 for ; Thu, 23 Mar 2000 13:48:22 -0500 (EST) Received: from charybdis.zembu.com (charybdis.zembu.com [209.157.144.99]) by mailrelay1.cc.columbia.edu (8.9.3/8.9.3) with SMTP id NAA06415 for ; Thu, 23 Mar 2000 13:48:19 -0500 (EST) Received: (qmail 31489 invoked from network); 23 Mar 2000 18:48:12 -0000 Received: from scylla.zembu.com (HELO gumby.henkel-wallace.org) (209.157.144.98) by charybdis.zembu.com with SMTP; 23 Mar 2000 18:48:12 -0000 Message-Id: <4.3.1.2.20000323104006.04856270@pop.zembu.com> X-Sender: (Unverified) X-Mailer: QUALCOMM Windows Eudora Version 4.3.1 Date: Thu, 23 Mar 2000 10:47:24 -0800 To: Frank da Cruz From: "D.V. Henkel-Wallace" Subject: RE: DEC multilingual code page, ISO 8859-1, etc. Cc: unicode@unicode.org In-Reply-To: <200003230157.RAA19027@unicode.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; format=flowed At 17:55 22-03-00 -0800, Frank da Cruz wrote: > > On a different topic, the 125x's will or shortly will be standardized (as > > well as de facto) code pages for info interchange on the Internet. > >By "standardized" you probably mean registered as MIME charsets. That's not >quite the same as being standardized in the ISO or ANSI sense, back in the >days when care was taken to make sure that "information interchange" was not >compromised. The world of computing is not just the Web.[...] > > One can > > gripe about the situation, but it evolved along with the web and there's > > nothing one can do to perfect things at this point. > > >One can resist further ruination. (I assume that by now Unicoders not interested in this topic will have stopped following this thread, so this can be considered "on topic"). As a further salvo in the "text vs markup" skirmish on this list, I suggest you point your browser to the surprisingly open-minded article at http://www.tidbits.com/tb-issues/TidBITS-522.html#lnk3. The technical details should be understood to most of us. But tThe key point in this article is that the battle of non-standard tags in the "browser wars" has not only flamed out, but is being trumped by -- get this -- _standards_. The drive to non-PC devices is causing a resurgence of interest in text, and in the separation of presentation from markup. I would have thought that computer scientists would get this intuitively, but this article, discussions on this list, and the marketplace all show that assumption to be naive. Markus' screed on the use of "ANSI" might appear to be off topic, but the word "Unicode" can end up being equally debased (and thus equally useless) if care isn't taken now. -d PS: apologies for all the bellicose metaphor in my message. 2-Jul-2000 19:23:16-GMT,3375;000000000001 Return-Path: Received: from mailrelay1.cc.columbia.edu (mailrelay1.cc.columbia.edu [128.59.35.143]) by uhaligani.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id PAA23501 for ; Sun, 2 Jul 2000 15:23:15 -0400 (EDT) Received: from bz2.apple.com (bz2.apple.com [17.254.0.82]) by mailrelay1.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id PAA14534 for ; Sun, 2 Jul 2000 15:23:15 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by bz2.apple.com (8.9.3/8.9.3) with ESMTP id MAA09512; Sun, 2 Jul 2000 12:20:12 -0700 (PDT) Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id LAA26467; Sun, 2 Jul 2000 11:07:58 -0800 (GMT-0800) Message-Id: <200007021907.LAA26467@unicode.org> Errors-To: root@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" X-UML-Sequence: 14498 (2000-07-02 19:05:14 GMT) From: John Hudson To: "Unicode List" Cc: Unicode mailing list Date: Sun, 2 Jul 2000 11:05:11 -0800 (GMT-0800) Subject: Re: Should furigana be considered part of "plain text"? At 09:16 AM 7/2/00 -0800, Doug Ewell wrote: >The problem with the phrase "plain text ceases to be plain if you decide >that layout information needs to be encoded" is the word "layout." In >the broadest sense, line and paragraph separation could be considered >"layout," and nobody would suggest doing away with the plain-text >characters needed to control those functions. I think this is a fair comment, if one assumes so broad a sense of 'layout'. On the other hand, I wouldn't consider a paragraph break to be necessarily 'layout', since it is primarily a textual convention that can be represented in layout in a myriad of different ways: double spacing, indentation, pilcrows, etc.. Now, we have interpreted a paragraph break in a particular way in plain text code -- a hard break and a move to a new line, i.e. the behaviour of a typewriter 'return' key -- and have further muddied things by using this code to force layout by, for instance, entering two paragraph breaks to achieve this particular layout. Personally, I think a truly plain text paragraph break would have no particular layout behaviour associated with it; rather, it would indicate a textual break that would be interpreted by applications according to user defined layout preferences. In e-mail, it is handy to have paragraphs separated by a 'double return', especially when several correspondents are being quoted, but elsewhere I would prefer indented, single-spaced paragraphs. Since it is the same textual break that is being indicated, I don't think these two layout options should be differently encoded. I think equating a digital paragraph break with the return key on a manual typewriter is actually a failure to encode plain text. That said, I realise that this might be an extremist view, and I certainly don't expect anybody to change anything now. Although I have to add, as someone who has typeset books, that having to remove all the double returns in a document before I can properly control the paragraph breaks is almost as annoying as replacing multiple tabs or word spaces when these have been used to force layout in 'plain text'. Thank goodness for macros. John Hudson Tiro Typeworks Vancouver, BC http://www.tiro.com 4-Jul-2000 6:26:40-GMT,7120;000000000001 Return-Path: Received: from mailrelay2.cc.columbia.edu (mailrelay2.cc.columbia.edu [128.59.35.139]) by monire.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id CAA08854 for ; Tue, 4 Jul 2000 02:26:39 -0400 (EDT) Received: from bz2.apple.com (bz2.apple.com [17.254.0.82]) by mailrelay2.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id CAA04705 for ; Tue, 4 Jul 2000 02:26:38 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by bz2.apple.com (8.9.3/8.9.3) with ESMTP id XAA08214; Mon, 3 Jul 2000 23:25:28 -0700 (PDT) Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id WAA01104; Mon, 3 Jul 2000 22:10:44 -0800 (GMT-0800) Message-Id: <200007040610.WAA01104@unicode.org> Errors-To: root@unicode.org Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" ; format="flowed" X-UML-Sequence: 14512 (2000-07-04 06:07:15 GMT) From: Edward Cherlin To: "Unicode List" Cc: Unicode mailing list Date: Mon, 3 Jul 2000 22:07:13 -0800 (GMT-0800) Subject: Re: Should furigana be considered part of "plain text"? At 11:05 AM -0800 7/2/00, John Hudson wrote: >At 09:16 AM 7/2/00 -0800, Doug Ewell wrote: > > >The problem with the phrase "plain text ceases to be plain if you decide > >that layout information needs to be encoded" is the word "layout." In > >the broadest sense, line and paragraph separation could be considered > >"layout," and nobody would suggest doing away with the plain-text > >characters needed to control those functions. The problem with the phrase "plain text" is that it is a polite fiction. ASCII characters, printing and non-printing, originated as commands to printers. What we originally called plain text files are those that would give reasonable results when printed on an ASCII teleprinter used as a terminal. The mechanical functions of Teletypes defined the original semantics of the control characters used in text files, and since carried over to screen and laser printer output-- CR Carriage Return Move printing point to beginning of current line. LF Line Feed Move printing point down one line. BS Back Space Move printing point one space left, unless at left limit. HT Horizontal Tab Move printing point right to next tab stop, unless at right limit. FF Form Feed Move printing point to top of next page. and is the reason why many of us call CR-LF either a line or paragraph break today. Explicit line breaks were, of course, essential on the original devices. Both CR and BS were routinely used for overstriking. The semantics of these and other ASCII control characters have been changing with technology. *Some* computer system designers, noticing that the demands of printing terminals were not requirements on system file internals, chose to use either CR alone or LF alone for line or paragraph ends, all without coordination. Line breaks in files became optional on systems that provided word wrap on output or display. Users were given options for setting tab stops, margins, and page lengths. Character 7F, DEL, originally meant "not a character; deleted" on punched paper tape, but began turning into destructive backspace even before tape died. ESC has undoubtedly mutated the most. The use of 1A SUB for end of file in several operating systems including PCDOS is a violation of the ASCII standard, which provides both 03 ETX (End of Text) and 04 EOT (End of Transmission), but who cared? There are now numerous incompatible formats bearing the name "plain text". Some are distinguished by the choice of line end string. In some cases, line ends are required, especially if there is a maximum line length. Lines of unlimited length may represent paragraphs or database records. Character sets other than ASCII may be used, especially 8859-1 or Windows code page 1252. These days, people want to be able to use any coded character set and still call it plain text. In fact, people want to introduce all kinds of markup, including furigana/rubi, language tags, ligature marking, and even character set shift sequences (not just the poky SI and SO), and still call the result plain text. >I think this is a fair comment, if one assumes so broad a sense of >'layout'. On the other hand, I wouldn't consider a paragraph break to be >necessarily 'layout', since it is primarily a textual convention that can >be represented in layout in a myriad of different ways: double spacing, >indentation, pilcrows, etc.. Now, we have interpreted a paragraph break in >a particular way in plain text code -- a hard break and a move to a new >line, i.e. the behaviour of a typewriter 'return' key -- by way of the Teletype >and have further >muddied things by using this code to force layout by, for instance, >entering two paragraph breaks > >to achieve this particular layout. The use of tabs, spaces, CR, and LF to lay out "plain text" is necessary in mail and news, and a total pain in documents that will need to be converted to anything else. >Personally, I think a truly plain text paragraph break would have no >particular layout behaviour associated with it; rather, it would indicate a >textual break that would be interpreted by applications according to user >defined layout preferences. In e-mail, it is handy to have paragraphs >separated by a 'double return', especially when several correspondents are >being quoted, but elsewhere I would prefer indented, single-spaced >paragraphs. Since it is the same textual break that is being indicated, I >don't think these two layout options should be differently encoded. I think >equating a digital paragraph break with the return key on a manual >typewriter is actually a failure to encode plain text. It is too late for such simple solutions. If we want to have a standard for plain text, we have to provide for each of the common usages. We have tried to start such a project twice on this list, and have failed utterly both times. >That said, I realise that this might be an extremist view, and I certainly >don't expect anybody to change anything now. Although I have to add, as >someone who has typeset books, that having to remove all the double returns >in a document before I can properly control the paragraph breaks is almost >as annoying as replacing multiple tabs or word spaces when these have been >used to force layout in 'plain text'. Thank goodness for macros. Hear, hear. I have wasted a remarkable amount of time over the years on reformatting Word documents into FrameMaker. The "pain text" [sic] markup habits of engineers are responsible for most of the work in those conversions. Thank goodness for book-wide search and replace in FM 6. >John Hudson > >Tiro Typeworks >Vancouver, BC >http://www.tiro.com Edward Cherlin, Spamfighter "It isn't what you don't know that hurts you, it's what you know that ain't so."--Mark Twain, or else some other prominent 19th century humorist and wit 6-Jul-2000 17:38:05-GMT,2882;000000000001 Return-Path: Received: from mailrelay2.cc.columbia.edu (mailrelay2.cc.columbia.edu [128.59.35.139]) by fozimane.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id NAA14701 for ; Thu, 6 Jul 2000 13:38:04 -0400 (EDT) Received: from relay7.apple.com (bz1.apple.com [17.254.0.81]) by mailrelay2.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id NAA25607 for ; Thu, 6 Jul 2000 13:38:03 -0400 (EDT) Received: from unicode.org (unicode2.apple.com [17.254.3.212]) by relay7.apple.com (8.9.3/8.9.3) with ESMTP id KAA01592; Thu, 6 Jul 2000 10:37:21 -0700 (PDT) Received: (from daemon@localhost) by unicode.org (8.9.3/8.9.3) id JAA13311; Thu, 6 Jul 2000 09:25:24 -0800 (GMT-0800) Message-Id: <200007061725.JAA13311@unicode.org> Errors-To: root@unicode.org Mime-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-UML-Sequence: 14562 (2000-07-06 17:21:42 GMT) From: John Cowan To: "Unicode List" Cc: Unicode List Date: Thu, 6 Jul 2000 09:21:38 -0800 (GMT-0800) Subject: Re: Control characters On Wed, 5 Jul 2000, john wrote: > > IIRC, the Model 37 Teletype interpreted 0A as a newline function, > > Also models 33 and 38, which also interpreted x0D as carriage return. Definitely not true of the model 33; it interpreted 0A as a line-feed, and if it received one not preceded by 0D it would do this. (Hopefully, you are all reading this email with a fixed-width font as God intended.) > > so ASCII allowed 0A to be interpreted as either LF or NL. > > That's non sequitur, but folks are like that. How so? The LF behavior is different from the NL behavior. > > DEC OSes notoriously distorted or misused the control characters, thus > > ^U = NAK was used to kill an input line instead of ^X = cancel. > > Since some of these editing commands were actually > merely echoed back from the main processor to the comm control > unit through which the terminal was connected, Definitely not true of any DEC OS; control characters were echoed as ^A, ^B, etc. > there was some > fogging over of the concepts of source and destination. The comm > controller would buffer up what was typed until it got a CR (0x0D) > and so these editing controls were actually commands to that comm > controller to clear its buffer. Again, not true of any DEC OS; characters were interpreted one by one and selectively echoed by the CPU only. There were no buffering serial-line controllers for the PDP-8, and they weren't introduced for the PDP-11 until later -- and even then, the typical mode was to stop buffering on *any* control character. -- John Cowan cowan@ccil.org "You need a change: try Canada" "You need a change: try China" --fortune cookies opened by a couple that I know 26-Nov-2001 20:41:25-GMT,3221;000000000005 Return-Path: Received: from unicode.org (unicode.org [209.235.17.55]) by kachifo.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id BAA15597 for ; Mon, 26 Nov 2001 01:24:21 -0500 (EST) Received: from sarasvati.unicode.org (localhost.localdomain [127.0.0.1]) by unicode.org (8.9.3/8.9.3) with ESMTP id AAA21504; Mon, 26 Nov 2001 00:37:19 -0500 Received: with LISTAR (v1.0.0; list unicode); Mon, 26 Nov 2001 00:37:19 -0500 (EST) Received: from mailserv2.iuinc.com (mailserv2.iuinc.com [206.245.164.55]) by unicode.org (8.9.3/8.9.3) with SMTP id AAA21498 for ; Mon, 26 Nov 2001 00:37:18 -0500 Received: (qmail 10978 invoked from network); 26 Nov 2001 05:43:03 -0000 Received: from unknown (HELO suse) (210.81.148.125) by mailserv2.iuinc.com with SMTP; 26 Nov 2001 05:43:03 -0000 Date: Mon, 26 Nov 2001 14:46:12 +0900 (JST) From: Gaspar Sinai X-X-Sender: gsinai@suse.blue-edge-tech.com To: Philipp Reichmuth cc: Arjun Aggarwal , Subject: Re: The real solution In-Reply-To: <725176373.20011125222014@web.de> Message-ID: MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII X-archive-position: 337 X-listar-version: Listar v1.0.0 Sender: unicode-bounce@unicode.org Errors-to: unicode-bounce@unicode.org X-original-sender: gsinai@yudit.org Precedence: bulk List-help: List-unsubscribe: List-software: Listar version 1.0.0 X-List-ID: X-list: unicode Sorry, I could not resist: On Sun, 25 Nov 2001, Philipp Reichmuth wrote: [..] > > Oh, the difference is probably that from this category of pages, you > can cut&paste into Word without garbling up your data because it uses > a *standard* encoding as opposed to the complete chaos of Hindi web > pages using their own fonts. Does that count as justification for > Unicode? [..] Please do not refer to MS Word. If Unicode Consortium had not listened to the industrial push of Micsrosoft to support their existing and broken standard, we would have a much better and cleaner character standard now. I don't think a character standard should go in the arena of typesetting to the extent Unicode does. I think it should provide clean and easy character standard with presentation forms that can be unambiguously put on a character based text terminal with no fancy typesetting features. Then if you want to typeset, go to the next level. As for cut & paste, it might work among Microsoft Apps but if one wants to interface an app with a disclosed clipboard format he will realize that he can not paste unicode text that contains '\u0000' characters. Impossible. And how about UCS-4 ? Forget it. As a text format it is not even existent. I think it would be much better to look for another benchmark engine. If I were Unicode Consortium I would build one. Just to prove that the standard works. Wait... maybe it does not? Thanks for your attention, I am really bad I know :) cheers gaspar 4-Dec-2001 6:12:31-GMT,4124;000000000001 Return-Path: Received: from unicode.org (unicode.org [209.235.17.55]) by kachifo.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id BAA15341 for ; Tue, 4 Dec 2001 01:12:31 -0500 (EST) Received: from sarasvati.unicode.org (localhost.localdomain [127.0.0.1]) by unicode.org (8.9.3/8.9.3) with ESMTP id XAA13805; Mon, 3 Dec 2001 23:15:05 -0500 Received: with LISTAR (v1.0.0; list unicode); Mon, 03 Dec 2001 23:15:05 -0500 (EST) Received: from imo-r02.mx.aol.com (imo-r02.mx.aol.com [152.163.225.98]) by unicode.org (8.9.3/8.9.3) with ESMTP id XAA13793 for ; Mon, 3 Dec 2001 23:15:04 -0500 From: DougEwell2@cs.com Received: from DougEwell2@cs.com by imo-r02.mx.aol.com (mail_out_v31_r1.9.) id 4.ac.1edd2ddd (3313) for ; Tue, 4 Dec 2001 00:30:54 -0500 (EST) Message-ID: Date: Tue, 4 Dec 2001 00:30:54 EST Subject: Unicode 1.0 names for control characters To: unicode@unicode.org MIME-Version: 1.0 Content-Type: text/plain; charset="US-ASCII" Content-Transfer-Encoding: 7bit X-Mailer: CompuServe 2000 32-bit sub 113 X-archive-position: 457 X-listar-version: Listar v1.0.0 Sender: unicode-bounce@unicode.org Errors-to: unicode-bounce@unicode.org X-original-sender: DougEwell2@cs.com Precedence: bulk List-help: List-unsubscribe: List-software: Listar version 1.0.0 X-List-ID: X-list: unicode I am surprised and puzzled by the "Unicode 1.0 Name" changes for some of the ASCII and Latin-1 control characters that were introduced in the latest beta version of the Unicode 3.2 data file (UnicodeData-3.2.0d5.txt): U+0009 HORIZONTAL TABULATION ==> CHARACTER TABULATION U+000B VERTICAL TABULATION ==> LINE TABULATION U+001C FILE SEPARATOR ==> INFORMATION SEPARATOR FOUR U+001D GROUP SEPARATOR ==> INFORMATION SEPARATOR THREE U+001E RECORD SEPARATOR ==> INFORMATION SEPARATOR TWO U+001F UNIT SEPARATOR ==> INFORMATION SEPARATOR ONE U+008B PARTIAL LINE DOWN ==> PARTIAL LINE FORWARD U+008C PARTIAL LINE UP ==> PARTIAL LINE BACKWARD Were these "new" names (e.g. CHARACTER TABULATION) really the original Unicode 1.0 names? I don't have my 1.0 book close at hand, but I know that they were *not* the names used in 1.1, according to the file "namesall.lst" from that version. (Aha, didn't think anyone still had that dusty old thing lying around?) IMHO, the new names CHARACTER TABULATION and LINE TABULATION are much less intuitive than HORIZONTAL TABULATION and VERTICAL TABULATION. Sometimes you even see the abbrevations HT and VT for these two characters. The new names appear to have been invented by someone who imagined a lack of clarity in the old names. I have seen the names IS4, IS3, IS2, and IS1 before, but they do not convey the same information as FS, GS, RS, and US. The latter names are more specific. The "old" names for these six control characters were used as far back as the original 1963 version of ASCII, according to Mackenzie (pp. 245-247). I don't know about the history of U+008B and U+008C, but again it seems strange that the "Unicode 1.0 name" for these characters is being changed at this late date. I know this 1.0 name field is not subject to the same rule of "no changes, ever" that applies to the regular Character Name field, but why should these names be changed at all? On this same topic, parenthesized abbreviations have been added to the 1.0 names for U+000A LIFE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), and U+0085 NEXT LINE (NEL). Does the addition of these abbreviations mean that they are now part of the official 1.0 name, and if so, why? Other characters typically don't have abbreviations as part of their names, even if they are as meaningful and as commonly used as these, and again it is a change from the 1.0 name we have seen for a decade. Perhaps I've been checking the beta files a bit TOO carefully. -Doug Ewell Fullerton, California 4-Dec-2001 11:10:39-GMT,5908;000000000001 Return-Path: Received: from unicode.org (unicode.org [209.235.17.55]) by kachifo.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id GAA20109 for ; Tue, 4 Dec 2001 06:10:39 -0500 (EST) Received: from sarasvati.unicode.org (localhost.localdomain [127.0.0.1]) by unicode.org (8.9.3/8.9.3) with ESMTP id EAA02011; Tue, 4 Dec 2001 04:16:04 -0500 Received: with LISTAR (v1.0.0; list unicode); Tue, 04 Dec 2001 04:16:04 -0500 (EST) Received: from pheidippides.md.chalmers.se (pheidippides.md.chalmers.se [129.16.237.91]) by unicode.org (8.9.3/8.9.3) with ESMTP id EAA02005 for ; Tue, 4 Dec 2001 04:16:03 -0500 Received: from chalmers95a69n (dhcp226-180.cs.chalmers.se [129.16.226.180]) by pheidippides.md.chalmers.se (8.10.1/8.10.1) with SMTP id fB4AVpH09059; Tue, 4 Dec 2001 11:31:52 +0100 (MET) From: "Kent Karlsson" To: , Subject: RE: Unicode 1.0 names for control characters Date: Tue, 4 Dec 2001 11:31:05 +0100 Message-ID: <000701c17cae$c817c800$b4e21081@chalmers95a69n> MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 (Normal) X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook CWS, Build 9.0.2416 (9.0.2910.0) In-Reply-To: X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4807.1700 Importance: Normal X-archive-position: 460 X-listar-version: Listar v1.0.0 Sender: unicode-bounce@unicode.org Errors-to: unicode-bounce@unicode.org X-original-sender: kentk@md.chalmers.se Precedence: bulk List-help: List-unsubscribe: List-software: Listar version 1.0.0 X-List-ID: X-list: unicode None of the C0 and in particular C1 names are really from Unicode 1.0. They are from ISO/IEC 6429. Now from ISO/IEC 6429:1992 (Third edition), rather than the second edition. Technically the same standard is available as Fifth edition of ECMA-48, 1991; ftp://ftp.ecma.ch/ecma-st/Ecma-048.pdf. The earlier editions are as far as I can see no longer available. Most of the name changes (in 1991!) were apparently done in an attempt to internationalise that (those) standard(s). "HORIZONTAL TABULATION" would be vertical if the lines are vertical, etc. So these name changes DO add clarity. The old abbreviations were retained, however. The mismatch between ISO/IEC 6429:1992 and Unicode 2.x "names" for these has caused a bit of a stir when someone was translating the names. For the ISn-s it appears that a generalisation was desired (I don't know if this was new in the 1991/1992 editions), with US, RS, GS, FS being suitable only if such a hierarchy was used. My reading is that the ISn-s are not necessarily hierarchical. Kind regards /kent k > -----Original Message----- > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On > Behalf Of DougEwell2@cs.com > Sent: den 4 december 2001 06:31 > To: unicode@unicode.org > Subject: Unicode 1.0 names for control characters > > > I am surprised and puzzled by the "Unicode 1.0 Name" changes > for some of the > ASCII and Latin-1 control characters that were introduced in > the latest beta > version of the Unicode 3.2 data file (UnicodeData-3.2.0d5.txt): > > U+0009 HORIZONTAL TABULATION ==> CHARACTER TABULATION > U+000B VERTICAL TABULATION ==> LINE TABULATION > U+001C FILE SEPARATOR ==> INFORMATION SEPARATOR FOUR > U+001D GROUP SEPARATOR ==> INFORMATION SEPARATOR THREE > U+001E RECORD SEPARATOR ==> INFORMATION SEPARATOR TWO > U+001F UNIT SEPARATOR ==> INFORMATION SEPARATOR ONE > U+008B PARTIAL LINE DOWN ==> PARTIAL LINE FORWARD > U+008C PARTIAL LINE UP ==> PARTIAL LINE BACKWARD > > Were these "new" names (e.g. CHARACTER TABULATION) really the > original > Unicode 1.0 names? I don't have my 1.0 book close at hand, > but I know that > they were *not* the names used in 1.1, according to the file > "namesall.lst" > from that version. (Aha, didn't think anyone still had that > dusty old thing > lying around?) > > IMHO, the new names CHARACTER TABULATION and LINE TABULATION > are much less > intuitive than HORIZONTAL TABULATION and VERTICAL TABULATION. > Sometimes you > even see the abbrevations HT and VT for these two characters. > The new names > appear to have been invented by someone who imagined a lack > of clarity in the > old names. > > I have seen the names IS4, IS3, IS2, and IS1 before, but they > do not convey > the same information as FS, GS, RS, and US. The latter names > are more > specific. > > The "old" names for these six control characters were used as > far back as the > original 1963 version of ASCII, according to Mackenzie (pp. 245-247). > > I don't know about the history of U+008B and U+008C, but > again it seems > strange that the "Unicode 1.0 name" for these characters is > being changed at > this late date. > > I know this 1.0 name field is not subject to the same rule of > "no changes, > ever" that applies to the regular Character Name field, but > why should these > names be changed at all? > > On this same topic, parenthesized abbreviations have been > added to the 1.0 > names for U+000A LIFE FEED (LF), U+000C FORM FEED (FF), > U+000D CARRIAGE > RETURN (CR), and U+0085 NEXT LINE (NEL). Does the addition of these > abbreviations mean that they are now part of the official 1.0 > name, and if > so, why? Other characters typically don't have abbreviations > as part of > their names, even if they are as meaningful and as commonly > used as these, > and again it is a change from the 1.0 name we have seen for a decade. > > Perhaps I've been checking the beta files a bit TOO carefully. > > -Doug Ewell > Fullerton, California > 4-Dec-2001 22:16:57-GMT,6684;000000000001 Return-Path: Received: from unicode.org (unicode.org [209.235.17.55]) by kachifo.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id RAA10161 for ; Tue, 4 Dec 2001 17:16:56 -0500 (EST) Received: from sarasvati.unicode.org (localhost.localdomain [127.0.0.1]) by unicode.org (8.9.3/8.9.3) with ESMTP id PAA05409; Tue, 4 Dec 2001 15:17:38 -0500 Received: with LISTAR (v1.0.0; list unicode); Tue, 04 Dec 2001 15:17:38 -0500 (EST) Received: from inergen.sybase.com (inergen.sybase.com [192.138.151.43]) by unicode.org (8.9.3/8.9.3) with ESMTP id PAA05403 for ; Tue, 4 Dec 2001 15:17:38 -0500 Received: from smtp2.sybase.com (sybgate2.sybase.com [130.214.69.6]) by inergen.sybase.com with ESMTP id NAA15282; Tue, 4 Dec 2001 13:32:50 -0800 (PST) Received: from olympus.sybase.com (localhost [127.0.0.1]) by smtp2.sybase.com with ESMTP id NAA11169; Tue, 4 Dec 2001 13:33:03 -0800 (PST) Received: from birdie.sybase.com by olympus.sybase.com (8.8.8+Sun/SMI-SVR4/SybEC3.5) id NAA03115; Tue, 4 Dec 2001 13:33:01 -0800 (PST) Received: (from kenw@localhost) by birdie.sybase.com (8.8.8+Sun/8.8.8) id NAA26690; Tue, 4 Dec 2001 13:33:01 -0800 (PST) Date: Tue, 4 Dec 2001 13:33:01 -0800 (PST) From: Kenneth Whistler Message-Id: <200112042133.NAA26690@birdie.sybase.com> To: DougEwell2@cs.com Subject: Re: Unicode 1.0 names for control characters Cc: unicode@unicode.org, kenw@sybase.com X-Sun-Charset: US-ASCII X-archive-position: 466 X-listar-version: Listar v1.0.0 Sender: unicode-bounce@unicode.org Errors-to: unicode-bounce@unicode.org X-original-sender: kenw@sybase.com Precedence: bulk List-help: List-unsubscribe: List-software: Listar version 1.0.0 X-List-ID: X-list: unicode Doug wrote: > I am surprised and puzzled by the "Unicode 1.0 Name" changes for some of the > ASCII and Latin-1 control characters that were introduced in the latest beta > version of the Unicode 3.2 data file (UnicodeData-3.2.0d5.txt): > > U+0009 HORIZONTAL TABULATION ==> CHARACTER TABULATION > U+000B VERTICAL TABULATION ==> LINE TABULATION > U+001C FILE SEPARATOR ==> INFORMATION SEPARATOR FOUR > U+001D GROUP SEPARATOR ==> INFORMATION SEPARATOR THREE > U+001E RECORD SEPARATOR ==> INFORMATION SEPARATOR TWO > U+001F UNIT SEPARATOR ==> INFORMATION SEPARATOR ONE > U+008B PARTIAL LINE DOWN ==> PARTIAL LINE FORWARD > U+008C PARTIAL LINE UP ==> PARTIAL LINE BACKWARD Well, *someone* is clearly paying close attention! And the editors haven't even officially announced the Unicode 3.2 beta period yet. > > Were these "new" names (e.g. CHARACTER TABULATION) really the original > Unicode 1.0 names? No, they were not. The older names were the Unicode 1.0 names for U+0009, U+000B, U+001C..U+001F. Unicode 1.0 didn't have *any* names for C1 control codes. The official UTC doctrine now is that C0/C1 control characters do not have Unicode names, formally. But what we do is print ISO 6429 control function names as aliases in the names list for the charts. (This was an official decision by the UTC for Unicode 3.0, so is not just an editorial whim.) The mechanism that the names list generation tool currently uses for that is to print "" in the name area and to grab the Unicode 1.0 name field for the alias (if one exists). This is special-cased code just for the control characters. The simplest fix for updating the ISO 6429 names to match the actual, current 6429 standard, was simply to update the Unicode 1.0 name field for the 8 instances you cite above. Incidentally, in case you are worried about historic accuracy here, the "Unicode 1.0 name" field was already fully suborned for the Unicode 3.0 publication, since the ISO 6429 C1 function names were inserted into that field, even though Unicode 1.0 had *no* names at all for C1 control characters. > IMHO, the new names CHARACTER TABULATION and LINE TABULATION are much less > intuitive than HORIZONTAL TABULATION and VERTICAL TABULATION. Sometimes you > even see the abbrevations HT and VT for these two characters. The new names > appear to have been invented by someone who imagined a lack of clarity in the > old names. Kent explained the standards rationale for updating these. It is a matter of actually using the names from the published version of the standard we are nominally referring to. Incidentally, take a look also at NamesList-3.2.0d3.txt in the same BETA directory. It shows that all the older C0 names have been retained as further aliases, since they are actually more familiar to most people, as you are pointing out. > > The "old" names for these six control characters were used as far back as the > original 1963 version of ASCII, according to Mackenzie (pp. 245-247). Yep. Venerable names. Honored names. Useful names. > > I know this 1.0 name field is not subject to the same rule of "no changes, > ever" that applies to the regular Character Name field, but why should these > names be changed at all? Aliases, actually, from the Unicode point of view, not formal names. And Kent explained why update the aliases. > > On this same topic, parenthesized abbreviations have been added to the 1.0 > names for U+000A LIFE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE > RETURN (CR), and U+0085 NEXT LINE (NEL). Does the addition of these > abbreviations mean that they are now part of the official 1.0 name, Nope. > and if > so, why? Other characters typically don't have abbreviations as part of > their names, even if they are as meaningful and as commonly used as these, > and again it is a change from the 1.0 name we have seen for a decade. Off and on, I work at a project to backrev from UnicodeData-1.1.5.txt to produce a Unicode 1.0 version of UnicodeData.txt, as it would have been defined if such a data file had been defined at the time. (It wasn't.) If I get around to posting that, then people can use the Unicode name field itself as the documentation of what the Unicode 1.0 name was! In the meantime, if you want the old time religion for the Unicode 1.0 names, you can extract them from UnicodeData-2.0.14.txt (the version officially released with Unicode 2.0), before the field was repurposed for the Unicode 3.0 publication. > > Perhaps I've been checking the beta files a bit TOO carefully. I suppose we should add a note to UnicodeData.html, clarifying the special status of the Unicode 1.0 name field for the control characters. --Ken > > -Doug Ewell > Fullerton, California > > 20-May-2002 11:44:37-GMT,2846;000000000001 Return-Path: Received: from unicode.org (unicode.org [209.235.17.55]) by dewberry.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id MAA05325 for ; Mon, 20 May 2002 12:44:37 -0400 (EDT) Received: from sarasvati.unicode.org (localhost.localdomain [127.0.0.1]) by unicode.org (8.9.3/8.9.3) with ESMTP id MAA09865; Mon, 20 May 2002 12:04:28 -0400 Received: with ECARTIS (v1.0.0; list unicode); Mon, 20 May 2002 12:04:28 -0400 (EDT) Received: from mg01.austin.ibm.com (mg01.austin.ibm.com [192.35.232.18]) by unicode.org (8.9.3/8.9.3) with ESMTP id MAA09859 for ; Mon, 20 May 2002 12:04:27 -0400 Received: from austin.ibm.com (netmail.austin.ibm.com [9.3.7.137]) by mg01.austin.ibm.com (AIX4.3/8.9.3/8.9.3) with ESMTP id LAA16450 for ; Mon, 20 May 2002 11:05:41 -0500 Received: from popmail.austin.ibm.com (popmail.austin.ibm.com [9.53.247.178]) by austin.ibm.com (AIX4.3/8.9.3/8.9.3) with ESMTP id LAA46008 for ; Mon, 20 May 2002 11:06:14 -0500 Received: from jtcsv.com (markus2000.sanjose.ibm.com [9.43.222.33]) by popmail.austin.ibm.com (AIX4.3/8.9.3/8.7-client1.01) with ESMTP id LAA23698 for ; Mon, 20 May 2002 11:06:12 -0500 Message-ID: <3CE91F79.4000503@jtcsv.com> Date: Mon, 20 May 2002 09:08:25 -0700 From: Markus Scherer Organization: IBM User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:0.9.4) Gecko/20011019 Netscape6/6.2 X-Accept-Language: en,de,eo MIME-Version: 1.0 To: unicode Subject: Re: Encoding of symbols, and a "lock"/"unlock" pre-proposal References: <000e01c1fec5$367f2840$fa424244@anhmca.adelphia.net> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-archive-position: 3240 X-ecartis-version: Ecartis v1.0.0 Sender: unicode-bounce@unicode.org Errors-to: unicode-bounce@unicode.org X-original-sender: markus.scherer@jtcsv.com Precedence: bulk List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-ID: X-List-ID: X-list: unicode Personally, I find it counter-productive to add a hodge-podge of dingbats and miscellaneous symbols to Unicode, or any coded character set. They had practical uses when user interfaces and display systems could not handle icons and arbitrary images, but those times are long over. Witness the demise of the DOS codepages with block graphics when graphical UIs became available. In my personal opinion, I find that the inclusion of such symbols dimishes the credibility of Unicode as a standard and of the UTC as following reasonable principles and guidelines. markus 2-Jul-2002 9:15:30-GMT,4825;000000000001 Return-Path: Received: from unicode.org (unicode.org [209.235.17.55]) by marionberry.cc.columbia.edu (8.9.3/8.9.3) with ESMTP id KAA17490 for ; Tue, 2 Jul 2002 10:15:30 -0400 (EDT) Received: from sarasvati.unicode.org (localhost.localdomain [127.0.0.1]) by unicode.org (8.9.3/8.9.3) with ESMTP id HAA20142; Tue, 2 Jul 2002 07:13:41 -0400 Received: with ECARTIS (v1.0.0; list unicode); Tue, 02 Jul 2002 07:13:40 -0400 (EDT) Received: from BOBCAT.borware.com (bobcat.borware.com [213.88.207.165]) by unicode.org (8.9.3/8.9.3) with ESMTP id HAA20132 for ; Tue, 2 Jul 2002 07:13:31 -0400 Received: by BOBCAT.borware.com with Internet Mail Service (5.5.2655.55) id ; Tue, 2 Jul 2002 15:39:02 +0200 Message-ID: From: Michael Jansson To: "'unicode@unicode.org'" Subject: Can browsers show text? I don't think so! Date: Tue, 2 Jul 2002 15:39:01 +0200 MIME-Version: 1.0 X-Mailer: Internet Mail Service (5.5.2655.55) Content-Type: text/plain; charset="iso-8859-1" X-archive-position: 638 X-ecartis-version: Ecartis v1.0.0 Sender: unicode-bounce@unicode.org Errors-to: unicode-bounce@unicode.org X-original-sender: mjan@em2-solutions.com Precedence: bulk List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-ID: X-List-ID: X-list: unicode Postings on this list has recently touched the topic of using various languages in web pages. Comments has been made of the use of embedded fonts (eot and pfr), as well as the lack of support for these font formats in popular browsers. This is a topic which I am very enthusiastic about, so I can not help but to add a few comments myself. Let me start by posing a question: "Can modern browsers show text?" Specifically, can they show text of any language and formatting on all platforms? I have to say; No they can not (possibly with the exception of the browser Nophus). The problem with browsers today is that although they may support Unicode encoding schemes (e.g. UTF8), they typically rely on the platform/OS they run on to show text. Platform without complete Unicode 3.x support will thus not be able to show text correctly. For example, IE6 (or any other modern browser) supports UTF8 but Win98 does not support Unicode 3.x. IE6 is thus not able to show Unicode text on Win98. You may of course be able to show some Unicode text on some platforms. This is far from claiming that a browser support Unicode though. At most, you may claim that a browser on a particular platform support some part of Unicode. Further more, even if a browser knew how to rendered text (e.g. know about the nitty-gritty details of glyph ordering, positioning and shaping that are language specific), you need something called a font to show text. Fonts can be provided as web resources through CSS 2, through a construct known as @font-family rules. However, there are no browser that fully support CSS 2 today, and in particular @font-family rules. There are browser that support @font-family on some platforms (e.g. for eot files on Windows). Again, this is far from claiming that a browser support fonts on the web. Modern browsers know how to show the characters 'A'-'Z' and a few other characters as long as you don't expect to format the text with a specific font. You will get into trouble as soon as you want to use a font or characters from other languages. You may find a solution for some languages and some fonts on some platforms. Yet again, this is far from claiming that modern browsers can show text. (I do not consider solutions where you have to download a 10MB+ language package to see a page in a foreign language. It's not a viable solution.) So what we have today are applications called "web browsers" that are very good at showing images, and animations. They are not very good at showing text, other than unformatted English text. Fortunately, there are third party solutions to work around some of the problems I mention above. Bitstreams "FontPlayer" (for pfr fonts for IE 5.x and Nav 4.x on Windows), MS Typography's WEFT tools (for eot fonts in IE 5.x on Windows), and our own FAIRY server solution (for eot fonts and language support in IE 5.x, Nav 4.x, Nav 6.x and Opera 5.x on Mac and Win). I do admire the work that people have done in creating quite outstanding web browsers through the years, sometimes with no other reward than peoples appreciation. I only wish that time were spent on supporting text, and not just flashy content. Regards, em2 Solutions Michael Jansson 14-Aug-2002 14:40:14-GMT,4787;000000000001 Return-Path: Received: from unicode.org (unicode.org [209.235.17.55]) by marionberry.cc.columbia.edu (8.12.3/8.12.3) with ESMTP id g7EJeE6t011805 for ; Wed, 14 Aug 2002 15:40:14 -0400 (EDT) Received: from sarasvati.unicode.org (localhost.localdomain [127.0.0.1]) by unicode.org (8.9.3/8.9.3) with ESMTP id MAA05556; Wed, 14 Aug 2002 12:51:59 -0400 Received: with ECARTIS (v1.0.0; list unicode); Wed, 14 Aug 2002 12:51:59 -0400 (EDT) Received: from inergen.sybase.com (inergen.sybase.com [192.138.151.43]) by unicode.org (8.9.3/8.9.3) with ESMTP id MAA05541 for ; Wed, 14 Aug 2002 12:51:59 -0400 Received: from smtp1.sybase.com (sybgate [10.22.97.84]) by inergen.sybase.com with ESMTP id MAA07829 for ; Wed, 14 Aug 2002 12:20:51 -0700 (PDT) Received: from smtp1.sybase.com (localhost [127.0.0.1]) by smtp1.sybase.com (Pro-8.9.3/Pro-8.9.3/sendmail 8.9.3 smtp1 2000/11/20) with ESMTP id MAA10128 for ; Wed, 14 Aug 2002 12:20:54 -0700 (PDT) Received: from olympus-dublin.sybase.com (olympus-dum.sybase.com [10.22.97.110]) by smtp1.sybase.com (Pro-8.9.3/Pro-8.9.3/sendmail 8.9.3 smtp1 2000-11-20) with ESMTP id MAA10107; Wed, 14 Aug 2002 12:20:53 -0700 (PDT) Received: from birdie.sybase.com (birdie.sybase.com [10.22.85.43]) by olympus-dublin.sybase.com (8.10.2+Sun/8.10.2) with ESMTP id g7EJKN501665; Wed, 14 Aug 2002 12:20:23 -0700 (PDT) Received: (from kenw@localhost) by birdie.sybase.com (8.8.8+Sun/8.8.8) id MAA10594; Wed, 14 Aug 2002 12:20:23 -0700 (PDT) Date: Wed, 14 Aug 2002 12:20:23 -0700 (PDT) From: Kenneth Whistler Message-Id: <200208141920.MAA10594@birdie.sybase.com> To: dewell@adelphia.net Subject: Re: Furigana Cc: unicode@unicode.org, kenw@sybase.com X-Sun-Charset: US-ASCII X-archive-position: 1867 X-ecartis-version: Ecartis v1.0.0 Sender: unicode-bounce@unicode.org Errors-to: unicode-bounce@unicode.org X-original-sender: kenw@sybase.com Precedence: bulk List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-ID: X-List-ID: X-list: unicode Doug (and Michael also): > What if I *want* to design an annotation-aware rendering mechanism? > Suppose I read Section 13.6 and decide that, instead of just throwing > the annotation characters away, I should attempt to display them > directly above (and smaller than) the "normal" text, the way furigana > are displayed above kanji. > > This would work not only for typical Japanese ruby, but also for > Michael's English-or-Swedish-over-Bliss scenario. It might even be > useful in assisting beleaguered Azerbaijanis, for example, by annotating > Latin-script text with its Cyrillic equivalent. (Just a thought.) > > Would this be conformant? Well, technically conformant, but not wise. If commonly available display and rendering mechanisms are not rendering them as interlinear annotations, then you aren't really providing much assistance here by using a mechanism designed for internal anchors and trying to turn it into something it isn't really up to snuff for. Frankly, you would be much better off making use of the Ruby annotation schemes available in markup languages, which will give you better scoping and attribute mechanisms. Stop worrying a moment about "Why are these characters standardized, and why the hedoublehockeysticks can't I use them?!" and think about the problem that furigana or any other interlinear annotation rendering system has to address: a. How are the annotations adjusted? Left-adjusted, centered, something else? And what point(s) are they synched on? b. If the annotated text or the annotation itself consist of multiple units, are there subalignments? E.g. note note note note text text textextextext text or note note note note text text textextextext text c. Can an annotation itself be stacked into a multiline form? note note note nononononote text d. Can the text of the annotation itself in turn be annotated? e. Can the text have two or more coequal annotations? And if so, how are they aligned? e. If the annotation is in a distinct style from the text it annotates, how is that indicated and controlled? f. How is line-break controlled on a line which also has an annotation? And so on. This is all the kind of stuff that clearly smacks to me of document formatting concerns and rich text. Why anyone would consider such things to be plain text rather escapes me. --Ken 5-Nov-2002 1:05:12-GMT,3336;000000000001 Return-Path: Received: from unicode.org (unicode.org [209.235.17.55]) by marionberry.cc.columbia.edu (8.12.3/8.12.3) with ESMTP id gA565B5x008898 for ; Tue, 5 Nov 2002 01:05:12 -0500 (EST) Received: from sarasvati.unicode.org (localhost.localdomain [127.0.0.1]) by unicode.org (8.11.6/8.11.6) with ESMTP id gA55Vqn04893; Tue, 5 Nov 2002 00:31:53 -0500 Received: with ECARTIS (v1.0.0; list unicode); Tue, 05 Nov 2002 00:31:52 -0500 (EST) Received: from smtprelay2.dc3.adelphia.net (smtprelay2.dc3.adelphia.net [24.50.78.5]) by unicode.org (8.11.6/8.11.6) with ESMTP id gA55Vqn04887 for ; Tue, 5 Nov 2002 00:31:52 -0500 Received: from DouglasEwell.anhmca.adelphia.net ([68.66.66.149]) by smtprelay2.dc3.adelphia.net (Netscape Messaging Server 4.15) with SMTP id H538O708.10B; Tue, 5 Nov 2002 00:31:19 -0500 Message-ID: <00a501c2848c$88f2da20$95424244@anhmca.adelphia.net> From: "Doug Ewell" To: "Unicode Mailing List" Cc: "Joseph Boyle" , "'Edward H Trager'" References: <00CF737ADB57134A833B4A5C0146761507CBEA@SDCEXMB02.corp.siebel.com> Subject: Re: PRODUCING and DESCRIBING UTF-8 with and without BOM Date: Mon, 4 Nov 2002 21:30:57 -0800 MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: 7bit X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 5.50.4807.1700 X-MimeOLE: Produced By Microsoft MimeOLE V5.50.4807.1700 X-archive-position: 3066 X-ecartis-version: Ecartis v1.0.0 Sender: unicode-bounce@unicode.org Errors-to: unicode-bounce@unicode.org X-original-sender: dewell@adelphia.net Precedence: bulk List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-ID: X-List-ID: X-list: unicode Joseph Boyle wrote: > Newline problems are a good analogy. They still require bookkeeping of > different formats and attention in any new coding and cause new bugs, > even though the problem has been around for decades. Nobody is holding > their breath for any of the platforms to change their newline > convention to match the others or even update all their tools to deal > with the differences - bare LF still doesn't work in Notepad. Of the hundreds of little utility programs I've written over the past 10 years or so, one of the ones I still use most often is FIXCRLF, which (as you might expect) converts files between different CR/LF conventions. I have to; most text files downloaded from the Internet are LF, but most DOS/Windows tools demand CRLF. It's a shame, but hardly a surprise, that the industry could never standardize on one or the other. The invention of U+2028 LINE SEPARATOR was supposed to relieve us of all this misery -- but ironically, the success of UTF-8 has probably killed LS for good. Not only do people now expect Unicode text files to be backward-compatible with ASCII, which favors CR and/or LF instead of LS, but the single character LS requires more bytes in UTF-8 than the two characters CR and LF. -Doug Ewell Fullerton, California 18-Nov-2002 8:23:02-GMT,3207;000000000011 Return-Path: Received: from unicode.org (unicode.org [209.235.17.55]) by dewberry.cc.columbia.edu (8.12.3/8.12.3) with ESMTP id gAIDMxgv003224 for ; Mon, 18 Nov 2002 08:23:00 -0500 (EST) Received: from sarasvati.unicode.org (localhost.localdomain [127.0.0.1]) by unicode.org (8.11.6/8.11.6) with ESMTP id gAICnbb18831; Mon, 18 Nov 2002 07:49:38 -0500 Received: with ECARTIS (v1.0.0; list unicode); Mon, 18 Nov 2002 07:49:37 -0500 (EST) Received: from barry.mail.mindspring.net (barry.mail.mindspring.net [207.69.200.25]) by unicode.org (8.11.6/8.11.6) with ESMTP id gAICnbb18825 for ; Mon, 18 Nov 2002 07:49:37 -0500 Received: from 1cust138.tnt13.krk1.da.uu.net ([67.250.82.138] helo=asmusf7500) by barry.mail.mindspring.net with esmtp (Exim 3.33 #1) id 18DlLG-0005w3-00; Mon, 18 Nov 2002 07:49:31 -0500 Message-Id: <4.2.0.58.20021118044648.00af9710@popd.ix.netcom.com> X-Sender: asmusf@popd.ix.netcom.com X-Mailer: QUALCOMM Windows Eudora Pro Version 4.2.0.58 Date: Mon, 18 Nov 2002 04:55:32 -0800 To: "Dominikus Scherkl" , From: Asmus Freytag Subject: RE: The result of the plane 14 tag characters review. In-Reply-To: <2F89C141B5B67645BB56C03853757882481761@guk1d002.glueckkanj a.org> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; format=flowed X-archive-position: 3360 X-ecartis-version: Ecartis v1.0.0 Sender: unicode-bounce@unicode.org Errors-to: unicode-bounce@unicode.org X-original-sender: asmusf@ix.netcom.com Precedence: bulk List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-ID: X-List-ID: X-list: unicode At 11:50 AM 11/18/02 +0100, Dominikus Scherkl wrote: > > I agree that in this example, higher-level markup would do > > all that is necessary. >But I'd like to read a "README.TXT" with a plain-text editor. >These files are very common - and if they're not deprecated >using plane-14-tags would be very nice to have in an multi-language >readme (where higher-level tagging is not available). You might find it very un-nice in practice, since plain text editors/viewers are notorious for not supporting tagging of any kind. In fact, all you are likely to get are a series of uninterpreted black boxes for the tags. As a result of being monofont plain text viewers/editors are also notorious for not supporting much beyond a limited repertoire of characters [a few noble exceptions to this rule notwithstanding]. Unless a widely used plain-text protocol requires or supports these characters, they remain a solution in search of a problem. I still haven't seen evidence of such a protocol. Remember, just because something's in the standard doesn't mean that it's magically supported everywhere. If it can't be added to existing systems by simple font extension (or similar updates) it may not be supported for a long time: Until it's widely supported, it's de-facto not available to end-users. A./ 18-Nov-2002 12:13:04-GMT,2765;000000000001 Return-Path: Received: from unicode.org (unicode.org [209.235.17.55]) by marionberry.cc.columbia.edu (8.12.3/8.12.3) with ESMTP id gAIHCx1n016735 for ; Mon, 18 Nov 2002 12:13:00 -0500 (EST) Received: from sarasvati.unicode.org (localhost.localdomain [127.0.0.1]) by unicode.org (8.11.6/8.11.6) with ESMTP id gAIFaYb13489; Mon, 18 Nov 2002 10:36:35 -0500 Received: with ECARTIS (v1.0.0; list unicode); Mon, 18 Nov 2002 10:36:34 -0500 (EST) Received: from watsol.cc.columbia.edu (IDENT:cu41754@watsol.cc.columbia.edu [128.59.39.139]) by unicode.org (8.11.6/8.11.6) with ESMTP id gAIFaYb13483 for ; Mon, 18 Nov 2002 10:36:34 -0500 Received: from watsol.cc.columbia.edu (localhost [127.0.0.1]) by watsol.cc.columbia.edu (8.12.3/8.12.3) with ESMTP id gAIFaGAr006469; Mon, 18 Nov 2002 10:36:17 -0500 (EST) Received: (from fdc@localhost) by watsol.cc.columbia.edu (8.12.3/8.12.3/Submit) id gAIFaFOI006468; Mon, 18 Nov 2002 10:36:15 -0500 (EST) Date: Mon, 18 Nov 2002 10:36:14 EST From: Frank da Cruz To: Asmus Freytag Cc: "Dominikus Scherkl" , Subject: RE: The result of the plane 14 tag characters review. In-Reply-To: Your message of Mon, 18 Nov 2002 04:55:32 -0800 Message-ID: X-archive-position: 3365 X-ecartis-version: Ecartis v1.0.0 Sender: unicode-bounce@unicode.org Errors-to: unicode-bounce@unicode.org X-original-sender: fdc@columbia.edu Precedence: bulk List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-ID: X-List-ID: X-list: unicode > As a result of being monofont plain text viewers/editors are also notorious > for not supporting much beyond a limited repertoire of characters [a few > noble exceptions to this rule notwithstanding]. > > Unless a widely used plain-text protocol requires or supports these > characters, they remain a solution in search of a problem. I still haven't > seen evidence of such a protocol. > We're doing our best. Kermit software supports Unicode in plain text terminal sessions: http://www.columbia.edu/kermit/glass.html and Linux is moving in that direction too (e.g. the Linux console in the latest Red Hat release, as well as many Linux / GNU plain-text utilities, now including EMACS 21.1): http://www.cl.cam.ac.uk/~mgk25/unicode.html Plain-text interactive shell and application access is still a widely used protocol, and is becoming more internationalized every day thanks to Unicode. - Frank 5-Feb-2003 20:08:56-GMT,4090;000000000001 Return-Path: Received: from unicode.org (unicode.org [209.235.17.55]) by dewberry.cc.columbia.edu (8.12.3/8.12.3) with ESMTP id h1618tPw018955 for ; Wed, 5 Feb 2003 20:08:56 -0500 (EST) Received: from sarasvati.unicode.org (localhost.localdomain [127.0.0.1]) by unicode.org (8.11.6/8.11.6) with ESMTP id h160XVc12352; Wed, 5 Feb 2003 19:33:31 -0500 Received: with ECARTIS (v1.0.0; list unicode); Wed, 05 Feb 2003 19:33:31 -0500 (EST) Received: from mtiwmhc13.worldnet.att.net (mtiwmhc13.worldnet.att.net [204.127.131.117]) by unicode.org (8.11.6/8.11.6) with ESMTP id h160XUc12346 for ; Wed, 5 Feb 2003 19:33:30 -0500 Message-Id: <200302060033.h160XUc12346@unicode.org> Received: from mtiwebc20 (mtiwebc20.worldnet.att.net[204.127.135.59]) by mtiwmhc13.worldnet.att.net (mtiwmhc13) with SMTP id <20030206003322113005j7nde>; Thu, 6 Feb 2003 00:33:22 +0000 Received: from [216.126.184.133] by mtiwebc20; Thu, 06 Feb 2003 00:33:21 +0000 From: jameskass@att.net To: unicode@unicode.org Subject: Re: VS vs. P14 (was Re: Indic Devanagari Query) Date: Thu, 06 Feb 2003 00:33:21 +0000 X-Mailer: AT&T Message Center Version 1 (Nov 25 2002) X-Authenticated-Sender: amFtZXNrYXNzQGF0dC5uZXQ= X-archive-position: 4033 X-ecartis-version: Ecartis v1.0.0 Sender: unicode-bounce@unicode.org Errors-to: unicode-bounce@unicode.org X-original-sender: jameskass@att.net Precedence: bulk List-help: List-unsubscribe: List-software: Ecartis version 1.0.0 List-ID: X-List-ID: X-list: unicode . Peter Constable wrote, > Sure, but why do we want to place so much demand on plain text when the > vast majority of content we interchange is in some form of marked-up or > rich text? Let's let plain text be that -- plain -- and look to the markup > conventions that we've invested so much in and that are working for us to > provide the kinds of thing that we designed markup for in the first place. > Besides, a "plain-text" file that begins and ends with p14 tags is a > marked-up file, whether someone calls it "plain text" or not. We have > little or no infrastructure for handling that form of markup, and a large > and increasing amount of infrastructure for handling the more typical forms > of markup. We place so much demand on plain text because we use plain text. We continue to advance from the days when β€œplain text” meant ASCII only rendered in bitmapped monospaced monochrome. We don’t rely on mark-up or higher protocols to distinguish between different European styles of quotation marks. We no longer need proprietary rich-text formats and font switching abilities to be able to display Greek and Latin text from the same file. > I repeat, plain text remains legible without anything indicating which eng > (or whatever) may be preferred by the author, and (since the requirement > for plain text is legibility) therefore this is not really an argument for > using p14 language tags. IMO. Is legibility the only requirement of plain text? Might additional requirements include appropriate, correct encoding and correct display? To illustrate a legible plain text run which displays as intended (all things being equal) yet is not appropriately encoded (this e-mail is being sent as plain text UTF-8): 𝑰𝒇 π’šπ’π’– 𝒄𝒂𝒏 𝒓𝒆𝒂𝒅 π’•π’‰π’Šπ’” π’Žπ’†π’”π’”π’‚π’ˆπ’†... π’šπ’π’– π’Žπ’‚π’š π’˜π’Šπ’”π’‰ 𝒕𝒐 π’‹π’π’Šπ’ 𝑴𝑨𝑨𝑨* 𝒂𝒕 𝓫𝓡π“ͺ𝓱𝓫𝓡π“ͺ𝓱𝓫𝓡π“ͺ𝓱𝓭𝓸𝓽𝓬𝓸𝓢 (*𝗠𝖺𝗍𝗁 π—”π—…π—‰π—π–Ίπ–»π–Ύπ—π—Œ π—”π–»π—Žπ—Œπ–Ύπ—‹π—Œ π—”π—‡π—ˆπ—‡π—’π—†π—ˆπ—Žπ—Œ) Clearly, correct and appropriate encoding (as well as legibility) should be a requirement of plain text. Is correct display also a valid requirement for plain text? It is for some... Respectfully, James Kass .