From Bugzilla Helper: User-Agent: Mozilla/4.76 [en] (X11; U; Linux 2.4.2-2 i686) Description of problem: "man whatever | col -b" produces truncated man pages. Sleuthing suggests that whenever "col -b" hits a soft hyphen, it gives up. One could debate whether to classify this as a col bug or a troff bug. I picked col, but I'm not committed to that. As far as I know, this is the *main* use for "col -b", so one of them ought to be fixed. How reproducible: Always Steps to Reproduce: 1.echo -e 'foo\255bar' | col -b 2. 3. Actual Results: You see only "foo" Expected Results: You should see foo-bar Additional info:
Oops. Looks like "col -b" chokes on other high-bit chars also found in man pages,like the copyright symbol. Bottom line: As the man page says, "col" is a tool for filtering "nroff" and "tbl" output. It shouldn't choke on it.
1. The 'man man' page says: To get a plain text version of a man page, without backspaces and underscores, try # man foo | col -b > foo.mantxt So I assume that -b is supposed to filter out more than just backspaces. 2. The -p switch to col appears to possibly be related to printing out funky chars. 3. 'man ls | col -b' prints out a complete man page here, so I can't reproduce the problem.
Behavior depends on value of $LANG./$LC_ALL/$LC_CTYPE The problem is exhibited if these are set to 'C' or unset (the POSIX- and ANSI-required defaults), though not if they are set to "en_US".
Well, it could be argued that if you don't have your locale envvars set correctly, you can't expect 'col' to know what extended characters are actually printable. I've never used 'col' and thus don't have a clue what the actual results are supposed to be (especially considering language interaction).
(1) Setting $LC_ALL/$LANG/$LC_COLLATE to C (or POSIX, or unsetting them) isn't setting them incorrectly. It's setting them to their POSIX- and ANSI-specified defaults. These three values (C, POSIX, unset), are the ones that are supposed to guarantee traditional UNIX behaviors. Indeed, POSIX and ANSI are careful to say that only these settings provide guaranteed behaviors. In other words, what a vendor does with these environment variables set in other ways is the vendor's business, but things had better work if you set them to C or unset them or set them to POSIX [ref: IEEE 1003.1b-1994, section 8.1.2.2]. (In passing, this is how it would be wise for Red Hat to set these variables for most of its routine testing, since it's how most professional programmers and certification labs should be expected to set those variables.) (2) If you're assigned the bug and still haven't even used col, why did I go to all the trouble to make you such clean test cases? :-) Here's another one that may help: contrast LANG=C man ls | LANG=C col -b with LANG=C man ls | LANG=en_US col -b From this, I conclude the bug's in col, not in troff.
1) I was suggesting that because you might merely be expecting col to behave in a way that it should only behave when a particular language is set... From what I can gather, it seems like col probably decides whether a character is printable, and that would most definitely require $LANG to be set. If LANG=C, it could reasonably be deciding that non-ascii chars are not printable... 2) I ran col while testing for this bug. I don't actually use col for anything, have no clue what it's supposed to do or why someone would use it, etc. etc.
(1) If LANG is unset, then the command (and isprint()) must work exactly as it does with LANG=C or LANG=POSIX. The standards require that non-ASCII characters be considered unprintable in this case. Ditto for the other two environment variables. (2) col -b is the traditional tool for filtering nroff output. (I'm repeating myself only because you said you "have no clue what it's supposed to do or why someone would use it, etc. etc." :-) It's used in pipelines and shell scripts to filter nroff output -- like man pages -- so you can read it easily in text editors or pass it through other filters like grep(1). (When nroff is producing "F^HFO^HOO^HO", grepping for "FOO" doesn't find it until you go through col(1).) Sounds like there may actually be two bugs: (a) col(1) shouldn't just bail when confronted with non-printable characters. Understand, it isn't just not printing things, it's giving up as soon as it hits high-bit characters. (Please actually try it.) (b) troff(1) is putting unprintable ( == isprint() returns false) characters in man pages in the C/POSIX locale. To me, both need to be fixed, but (b) seems more serious.
Hi, I just tried 'export LANG=C; man ls | col -b' with util-linux-2.11n, and didn't see any truncation. Could you try out the util-linux packages from http://people.redhat.com/2.11n-1/ and see if the problem has gone away, or if it's just me not doing the reproducing correctly?
When I try to go to http://people.redhat.com/2.11n-1/ the server disavows all knowledge of the page. Typo?
Sorry, the correct URL is http://people.redhat.com/sopwith/2.11n-1/
This bug seems to still be present in Redhat 9 and Fedora 1.0. What is the status? Paul
Sooo, picking this up after way too long a delay (thanks for bringing this to my attention!) The current version of 'col' uses getwchar() to read input. getwchar expects valid multibyte sequences as input, and treats it as an EOF if it gets an invalid multibyte character sequence. So basically the deal is that you can't throw random high-bit characters at 'col' and expect it to work - it has to be valid in the current locale. That makes the 'foo\255bar' example definitely invalid. I need an example man page that is broken with 'export LANG=C; man <X> | col -b', and then I can try to narrow things down for you.