Red Hat Bugzilla – Bug 40246
col -b chokes on soft hyphens
Last modified: 2007-04-18 12:33:11 EDT
From Bugzilla Helper:
User-Agent: Mozilla/4.76 [en] (X11; U; Linux 2.4.2-2 i686)
Description of problem:
"man whatever | col -b" produces truncated man pages.
Sleuthing suggests that whenever "col -b" hits a soft hyphen,
it gives up.
One could debate whether to classify this as a col bug or a troff bug.
I picked col, but I'm not committed to that.
As far as I know, this is the *main* use for "col -b",
so one of them ought to be fixed.
Steps to Reproduce:
1.echo -e 'foo\255bar' | col -b
Actual Results: You see only "foo"
Expected Results: You should see foo-bar
Oops. Looks like "col -b" chokes on other high-bit chars also
found in man pages,like the copyright symbol.
Bottom line: As the man page says, "col" is a tool for filtering
"nroff" and "tbl" output. It shouldn't choke on it.
1. The 'man man' page says:
To get a plain text version of a man page, without backspaces and
# man foo | col -b > foo.mantxt
So I assume that -b is supposed to filter out more than just backspaces.
2. The -p switch to col appears to possibly be related to printing out funky chars.
3. 'man ls | col -b' prints out a complete man page here, so I can't reproduce
Behavior depends on value of $LANG./$LC_ALL/$LC_CTYPE
The problem is exhibited if these are set to 'C' or unset (the POSIX- and
though not if they are set to "en_US".
Well, it could be argued that if you don't have your locale envvars set
correctly, you can't expect 'col' to know what extended characters are actually
I've never used 'col' and thus don't have a clue what the actual results are
supposed to be (especially considering language interaction).
(1) Setting $LC_ALL/$LANG/$LC_COLLATE to C (or POSIX, or unsetting them) isn't
incorrectly. It's setting them to their POSIX- and ANSI-specified defaults.
These three values (C, POSIX, unset), are the ones that are supposed to
traditional UNIX behaviors.
Indeed, POSIX and ANSI are careful to say that only these settings provide
In other words, what a vendor does with these environment variables set in other
ways is the
vendor's business, but things had better work if you set them to C or unset them
or set them to POSIX [ref: IEEE 1003.1b-1994, section 184.108.40.206].
(In passing, this is how it would be wise for Red Hat to set these variables for
most of its routine testing, since it's how most professional programmers and
certification labs should be expected to set those variables.)
(2) If you're assigned the bug and still haven't even used col, why did I go to
the trouble to make you such clean test cases? :-)
Here's another one that may help: contrast
LANG=C man ls | LANG=C col -b
LANG=C man ls | LANG=en_US col -b
From this, I conclude the bug's in col, not in troff.
1) I was suggesting that because you might merely be expecting col to behave in
a way that it should only behave when a particular language is set...
From what I can gather, it seems like col probably decides whether a character
is printable, and that would most definitely require $LANG to be set. If LANG=C,
it could reasonably be deciding that non-ascii chars are not printable...
2) I ran col while testing for this bug. I don't actually use col for anything,
have no clue what it's supposed to do or why someone would use it, etc. etc.
(1) If LANG is unset, then the command (and isprint()) must work exactly as it
with LANG=C or LANG=POSIX. The standards require that non-ASCII characters
be considered unprintable in this case. Ditto for the other two environment
(2) col -b is the traditional tool for filtering nroff output. (I'm repeating
myself only because
you said you "have no clue what it's supposed to do or why someone would use
it, etc. etc." :-)
It's used in pipelines and shell scripts to filter nroff output -- like man
so you can read it easily in text editors or pass it through other filters like
(When nroff is producing "F^HFO^HOO^HO", grepping for "FOO" doesn't find it
until you go
Sounds like there may actually be two bugs:
(a) col(1) shouldn't just bail when confronted with non-printable characters.
Understand, it isn't just not printing things, it's giving up as soon as it hits
(Please actually try it.)
(b) troff(1) is putting unprintable ( == isprint() returns false) characters in
man pages in the C/POSIX locale.
To me, both need to be fixed, but (b) seems more serious.
I just tried 'export LANG=C; man ls | col -b' with util-linux-2.11n, and didn't
see any truncation. Could you try out the util-linux packages from
http://people.redhat.com/2.11n-1/ and see if the problem has gone away, or if
it's just me not doing the reproducing correctly?
When I try to go to http://people.redhat.com/2.11n-1/
the server disavows all knowledge of the page.
Sorry, the correct URL is http://people.redhat.com/sopwith/2.11n-1/
This bug seems to still be present in Redhat 9 and Fedora 1.0. What
is the status?
Sooo, picking this up after way too long a delay (thanks for bringing
this to my attention!)
The current version of 'col' uses getwchar() to read input. getwchar
expects valid multibyte sequences as input, and treats it as an EOF if
it gets an invalid multibyte character sequence. So basically the deal
is that you can't throw random high-bit characters at 'col' and expect
it to work - it has to be valid in the current locale. That makes the
'foo\255bar' example definitely invalid.
I need an example man page that is broken with 'export LANG=C; man <X>
| col -b', and then I can try to narrow things down for you.