Bug 40246 - col -b chokes on soft hyphens
Summary: col -b chokes on soft hyphens
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Red Hat Linux
Classification: Retired
Component: util-linux
Version: 7.1
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Elliot Lee
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2001-05-11 14:39 UTC by Jeff Haemer
Modified: 2007-04-18 16:33 UTC (History)
0 users

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2004-12-02 19:50:05 UTC
Embargoed:


Attachments (Terms of Use)

Description Jeff Haemer 2001-05-11 14:39:34 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.76 [en] (X11; U; Linux 2.4.2-2 i686)

Description of problem:
"man whatever | col -b" produces truncated man pages.
Sleuthing suggests that whenever "col -b" hits a soft hyphen,
it gives up.

One could debate whether to classify this as a col bug or a troff bug.
I picked col, but I'm not committed to that.

As far as I know, this is the *main* use for "col -b",
so one of them ought to be fixed.

How reproducible:
Always

Steps to Reproduce:
1.echo -e 'foo\255bar' | col -b
2.
3.
	

Actual Results:  You see only "foo"

Expected Results:  You should see foo-bar

Additional info:

Comment 1 Jeff Haemer 2001-05-12 20:23:35 UTC
Oops.  Looks like "col -b" chokes on other high-bit chars also
found in man pages,like the copyright symbol.

Bottom line: As the man page says, "col" is a tool for filtering
"nroff" and "tbl" output.  It shouldn't choke on it.


Comment 2 Elliot Lee 2001-07-18 18:26:54 UTC
1. The 'man man' page says:
       To get a plain text version of a man page, without backspaces and
underscores, try

         # man foo | col -b > foo.mantxt

So I assume that -b is supposed to filter out more than just backspaces.

2. The -p switch to col appears to possibly be related to printing out funky chars.

3. 'man ls | col -b' prints out a complete man page here, so I can't reproduce
the problem.


Comment 3 Jeff Haemer 2001-07-18 19:44:12 UTC
Behavior depends on value of $LANG./$LC_ALL/$LC_CTYPE
The problem is exhibited if these are set to 'C'  or unset (the POSIX- and
ANSI-required defaults),
though not if they are set to "en_US".

Comment 4 Elliot Lee 2001-07-18 20:44:52 UTC
Well, it could be argued that if you don't have your locale envvars set
correctly, you can't expect 'col' to know what extended characters are actually
printable.

I've never used 'col' and thus don't have a clue what the actual results are
supposed to be (especially considering language interaction).

Comment 5 Jeff Haemer 2001-07-18 21:48:51 UTC
(1)  Setting $LC_ALL/$LANG/$LC_COLLATE to C (or POSIX, or unsetting them) isn't
setting them
incorrectly.  It's setting them to their POSIX- and ANSI-specified defaults.
These three values (C, POSIX, unset), are the ones that are supposed to
guarantee
traditional UNIX behaviors.

Indeed, POSIX and ANSI are careful to say that only these settings provide
guaranteed behaviors.
In other words, what a vendor does with these environment variables set in other
ways is the
vendor's business, but things had better work if you set them to C or unset them
or set them to POSIX  [ref: IEEE 1003.1b-1994, section 8.1.2.2].  

(In passing, this is how it would be wise for Red Hat to set these variables for
most of its routine testing, since it's how most professional programmers and
certification labs should be expected to set those variables.)

(2) If you're assigned the bug and still haven't even used col, why did I go to
all
the trouble to make you such clean test cases? :-)

Here's another one that may help: contrast

	LANG=C man ls |  LANG=C col -b

with

	LANG=C man ls |  LANG=en_US col -b

From this, I conclude the bug's in col, not in troff.

Comment 6 Elliot Lee 2001-07-18 22:53:41 UTC
1) I was suggesting that because you might merely be expecting col to behave in
a way that it should only behave when a particular language is set...

From what I can gather, it seems like col probably decides whether a character
is printable, and that would most definitely require $LANG to be set. If LANG=C,
it could reasonably be deciding that non-ascii chars are not printable...

2) I ran col while testing for this bug. I don't actually use col for anything,
have no clue what it's supposed to do or why someone would use it, etc. etc.

Comment 7 Jeff Haemer 2001-07-18 23:27:31 UTC
(1) If LANG is unset, then the command (and isprint()) must work exactly as it
does
with LANG=C or LANG=POSIX.  The standards require that non-ASCII characters
be considered unprintable in this case.  Ditto for the other two environment
variables.

(2) col -b is the traditional tool for filtering nroff output.  (I'm repeating
myself only because
you said  you "have no clue what it's supposed to do or why someone would use
it, etc. etc." :-)

It's used in pipelines and shell scripts to filter nroff output -- like man
pages -- 
so you can read it easily in text editors or pass it through other filters like
grep(1). 
(When nroff is producing "F^HFO^HOO^HO", grepping for "FOO" doesn't find it
until you go
through col(1).)

Sounds like there may actually be two bugs:

(a) col(1) shouldn't just bail when confronted with non-printable characters.
Understand, it isn't just not printing things, it's giving up as soon as it hits
high-bit characters.
(Please actually try it.)

(b) troff(1) is putting unprintable ( == isprint() returns false) characters in
man pages in the C/POSIX locale.

To me, both need to be fixed, but (b) seems more serious.

Comment 8 Elliot Lee 2001-12-28 18:44:31 UTC
Hi,

I just tried 'export LANG=C; man ls | col -b' with util-linux-2.11n, and didn't
see any truncation. Could you try out the util-linux packages from
http://people.redhat.com/2.11n-1/ and see if the problem has gone away, or if
it's just me not doing the reproducing correctly?

Comment 9 Jeff Haemer 2002-01-07 19:39:12 UTC
When I try to go to http://people.redhat.com/2.11n-1/
the server disavows all knowledge of the page.
Typo?

Comment 10 Elliot Lee 2002-01-07 19:43:59 UTC
Sorry, the correct URL is http://people.redhat.com/sopwith/2.11n-1/

Comment 11 Paul Tibbitts 2004-02-24 18:20:48 UTC
This bug seems to still be present in Redhat 9 and Fedora 1.0.  What 
is the status?

Paul

Comment 12 Elliot Lee 2004-07-07 15:24:22 UTC
Sooo, picking this up after way too long a delay (thanks for bringing
this to my attention!)

The current version of 'col' uses getwchar() to read input. getwchar
expects valid multibyte sequences as input, and treats it as an EOF if
it gets an invalid multibyte character sequence. So basically the deal
is that you can't throw random high-bit characters at 'col' and expect
it to work - it has to be valid in the current locale. That makes the
'foo\255bar' example definitely invalid.

I need an example man page that is broken with 'export LANG=C; man <X>
| col -b', and then I can try to narrow things down for you.


Note You need to log in before you can comment on or make changes to this bug.