88126 – (IT_38206) nroff fails with LANG=C on many man pages.

Bug 88126 (IT_38206) - nroff fails with LANG=C on many man pages.

Summary: nroff fails with LANG=C on many man pages.

Keywords:
Status:	CLOSED ERRATA
Alias:	IT_38206
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	groff
Sub Component:
Version:	3.0
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Miroslav Lichvar
QA Contact:	Mike McLean
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	187539
TreeView+	depends on / blocked

Reported:	2003-04-06 12:19 UTC by Stephen John Smoogen
Modified:	2007-11-30 22:06 UTC (History)
CC List:	5 users (show)
Fixed In Version:	RHBA-2006-0378
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-04-26 21:14:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
A possible way to let man run directory-dependent iconv on manpage (3.88 KB, patch) 2003-08-06 05:43 UTC, Philip Spencer	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2006:0378	0	normal	SHIPPED_LIVE	groff bug fix update	2006-07-19 19:12:00 UTC

Description Stephen John Smoogen 2003-04-06 12:19:35 UTC

Description of problem:

If the LANG is set to C for various reasons, then man pages will not be properly
formatted. This can be seen with the following:

LANG=C man toupper
iconv: illegal input sequence at position 1311

It dies because there is an UTF-8 character in the page. Tracking down the
problem led to the fact that man searches for nroff before groff and the nroff
command has the following line in it:

    iconv -f UTF-8 -t $CONV ${1+"$@"} | PATH=$GROFF_BIN_PATH:$PATH groff
-mtty-char $T $opts

iconv does not supress warnings and dies when it finds a command it cant understand.

LANG=C translates that line to:

iconv -f UTF-8 -t ANSI_X3.4-1968 

which dies.. adding a 

iconv -c -f ...

allows for the man page to be formatted correctly.

Version-Release number of selected component (if applicable):

[smooge@glasya smooge]$ rpm -q groff man glibc-common
groff-1.18.1-20
man-1.5k-6
glibc-common-2.3.2-24.9


How reproducible:
100%

Comment 1 Philip Spencer 2003-08-02 21:05:20 UTC

We see the same problem with LANG=en_US (which is our system's default setting; we cannot use en_US.UTF-8 as the default because many people log in to the system from terminals that are not unicode compatible.) Two comments to add:

(1) It's not that the man page contains a UTF-8 character -- rather, it's that the manpage contains a non-ASCII character which is ISO-8859-1 and NOT UTF-8.

(2) Adding -c to iconv does not completely solve the problem -- it causes the offending character to omitted altogether, instead of being displayed properly.

That line you mentioned in /usr/bin/gnroff is a patch added by RedHat to the standard GNU nroff, and this patch is the cause of the problems. It was presumably intended to deal with the fact that the localized manpages
(in /usr/share/man/<locale>/man1 etc.) are in UTF-8 format. It attempts to
convert the manpage from UTF-8 to the current encoding before processing it.

The problem is that none of the ordinary manpages are in UTF-8. Most are
plain ASCII but many (76 on our system) include non-ASCII ISO-8859-1 characters.
The nroff script fails on all such manpages because it is treating them as if
they were UTF-8.

The solution as I see it would be as follows:

   (1) nroff should not have a hard-coded assumption that its input file is
       in UTF-8. It should assume its input file is in the current locale
       unless specified otherwise by some new flag or environment variable.

   (2) The man program should, if it finds a page in one of the localized
       directories, either run iconv itself or pass this new flag to nroff.

       Otherwise, if the man program finds the page in a standard
       non-localized directory, it should pass it as is to nroff and nroff
       should not attempt to convert it. (OR ... maybe it should pass nroff
       a flag telling it the input is ISO-8859-1; that way, if the user's
       locale is UTF-8, it will get correctly converted to UTF-8).

A quick-and-dirty fix to the problem is to add "C", "en_*", and other similar language settings to the set of languages for which no conversion should be done:

--- /usr/bin/nroff.orig 2003-08-02 16:53:04.000000000 -0400
+++ /usr/bin/nroff      2003-08-02 14:35:15.000000000 -0400
@@ -97,7 +97,7 @@
 case "${LC_ALL-${LC_CTYPE-${LANG}}}" in
   ru_*.UTF-8 )
     T=-Tnippon ;;
-  ja_JP.ujis | ja_JP.eucJP | ko_KR.eucKR | *.UTF-8)
+  ja_JP.ujis | ja_JP.eucJP | ko_KR.eucKR | *.UTF-8 | en_* | C)
     ;;
   *)
     CONV="`locale charmap 2>/dev/null`" ;;

Obviously, however, this in an incomplete fix.

Comment 2 Milan Kerslager 2003-08-05 22:17:34 UTC

The proper solution is to have all manpages in UTF-8. Please fill a proper bug
report if RH has any.

The (1) is wrong because most of languages using locale with various default
encoding (ie. cs_CZ has ISO-8859-2 and not UTF-8) and no one know which encoding
is used iside manpage.

Comment 3 Philip Spencer 2003-08-05 23:23:06 UTC

>The proper solution is to have all manpages in UTF-8. 

Absolutely not. At least - it may be a good idea for all RedHat-provided
manpages to be in UTF-8 for the sake of consistency, but to rely on this
would result in an operating system that is incompatible with the man pages of
any other non-RH software that may get installed. (Very few non-RH man pages are
in UTF-8.) If "man" and "nroff" are incapable of handling input files that are
not in UTF-8, they are broken.

>Please fill a proper bug report if RH has any.

That's exactly what this is.

>languages using locale with various default encoding (ie. cs_CZ has ISO-8859-2
>and not UTF-8) and no one know which encoding is used iside manpage.

Having a hard-coded assumption inside nroff that it's input file is always UTF-8
is definitely wrong. 

The iconv patch should probably be backed out of nroff altogether and instead
the man program should be provided with a way to know the encoding of the man
pages in each directory (perhaps with a config file, or perhaps by the presence
of a .encoding file in each man directory which identifies the encoding of the
files in that directory) and should perform any desired conversion on them
first before passing them to nroff. Whatever "default" assumption man/nroff make
about unknown manpages should be configurable by the system administrator.

Comment 4 Milan Kerslager 2003-08-05 23:55:29 UTC

.encoding file could be cute but RH-only specific
You will not solve the problem as the admin must make a file (or recode
manpages). But recoding is better as what you would like to do when only few
manpages are in different encoding (and you want to not touch current directory
structure).
The admin should recode all man pages he installs on the system
Every other system has a trouble when man pages are in various encoding ie UTF-8
uniformity is the best solution for the future.
The switch for input encoding inside nroff could solve your problem.

Comment 5 Philip Spencer 2003-08-06 04:42:32 UTC

> he admin should recode all man pages he installs on the system

You're joking, right? You seriously think a sysadmin should have to manually recode every manpage whenever installing a package from a tarball? That could be hundreds of manpages! And you seriously think a sysadmin should have to edit the SPEC file of every packaged rpm file, inserting scripts to recode the manpages before packaging?

> But recoding is better as what you would like to do when only few manpages are in different encoding

It's not "a few" -- in the non-locale-specific directories, it's every single one! NONE of the non-localized manpages provided in RedHat 9 is in UTF-8. There are 76 which are not in plain text, and the vast majority of those are in ISO-8859-1. There are only a few that are in other encodings (a couple of perl manpages and the other ISO-8859-x manpages, primarily).

This presumably is also a fair reflection of the current state of other manpages out there. They fall into roughly three categories:

  (1) Legacy software -- almost all of these manpages are in ISO-8859-1.
  (2) The occasional piece of software that provides a manpage in another
      encoding but doesn't store it in a locale-specific directory.
  (3) Properly localized manpages--these are installed in different directories
      depending on the locale, and it's reasonable to expect these to all be in
      UTF-8. 

Therefore, it is reasonable behaviour for a "man" program to:

   - Assume manpages in /usr/share/man/man* are in ISO-8859-1
   - Assume manpages in /usr/share/man/<locale>/man* are in UTF-8
   - Allow a sysadmin to override these assumptions for exceptional
     files/directories.

Those overrides are few and far between, as compared to the work of converting everything in /usr/share/man/man* from ISO-8859-1 to UTF-8.

[ An installation with a lot of locally developed software in a non-ISO-8859-1 environment may have a few more of these exceptions to deal with, but I think it's a fair assumption that people writing software in non-English languages are much more familiar with localization issues and will much more quickly start putting their manpages into locale-specific directories than will developers of English language software.]

> uniformity is the best solution for the future.

The above strategy DOES permit a move to future uniformity: in the future, all manpages will be properly localized and stored in /usr/share/man/<locale>/man*.
Nothing will be left in /usr/share/man/man* except pure encoding-independent ASCII; everything else will be in /usr/share/man/<locale>/man*.

This strategy also preserves backward compatibility:

  - It works now, since now most manpages in /usr/share/man/man* are ISO-8859-1.
  - It will work in the future, since then those manpages will be pure ASCII
    and it doesn't matter what encoding you read them in.

Your suggested strategy, of having man assume everything everywhere is in UTF-8
and re-encoding all manpages which are not, will work in the future but will
not work now without a lot of extra effort on the part of system administrators everywhere.

Given the choice between a strategy which works both now and in the future,
or a strategy which will work in the future but requires a lot of extra effort to make it work now, I think the choice is clear.r

Comment 6 Philip Spencer 2003-08-06 05:43:51 UTC

Created attachment 93421 [details]
A possible way to let man run directory-dependent iconv on manpage

Comment 7 Milan Kerslager 2003-08-06 07:46:22 UTC

Don't get me wrong, but with bash, sed, awk or perl this is 4 lines script (or 1
line).
As there is 30-years default with assumption that all manpages are ASCII there
is no reason why not change this standard to UTF-8 (ASCII compatible) without
making another RH-ism that (probably) no other Unix has.
If you really need *every* man page marked with another encoding, you are free
to modify man scripts to fit your needs.

But - marking every manpage or making encoding-depandent directories is more
complicated than simple use of iconv:

Before make install do:

for $(find man*); do
  iconv -f WHATEVER -t UTF-8 $i > $i.bak
  mv -f $i.bak $i
done

Comment 8 Chris Ricker 2003-08-06 12:03:36 UTC

For one possibly related issue (not sure -- I don't understand encoding well
enough), see Bug 99311 (not all will have access to this bug)

On RHL 9, the X man pages display
On Cambridge alpha / beta, many don't
The ones which don't display won't display b/c they contain lines like:

.\"
.\" Copyright <a9> 2001 Keith Packard, member of The XFree86 Project, Inc.

and the <a9> encoding of the copyright symbol kills iconv....

BTW, Milan, when I try your script, I get:

[kaboom@skuld kaboom]$ iconv -f WHATEVER -t UTF-8 xrandr.1x > xrandr.1x.foo
iconv: conversion from `WHATEVER' is not supported
[kaboom@skuld kaboom]$ 

It looks like you have to actually know the original encoding ahead of time?

Comment 9 Philip Spencer 2003-08-06 13:44:08 UTC

> As there is 30-years default with assumption that all manpages are ASCII

The problem is, THAT ASSUMPTION IS NOT VALID. Rather, the unwritten default
has been that manpages are ISO-8859-1. (The 76 non-ASCII manpages in the RH9
distribution testify to that). It shouldn't have been, but the fact is, it is.

Trying to treat them all as UTF-8 breaks all those existing manpages.
Treating these old man pages as ISO-8859-1, and treating new ones in the
proper locale-dependent directories as UTF-8, does not break anything.

So why not go that route? Yes, it's a simple script, but if you manage
a site with a lot of machines it's still extremely cumbersome (especially 
keeping track of which pages it's been run on and which it hasn't).

> and the <a9> encoding of the copyright symbol kills iconv

That's exactly the same issue. Hex a9 is the ISO-8859-1 encoding for
copyright. Another example of a manpage in ISO-8859-1.

> iconv -f WHATEVER -t UTF-8 xrandr.1x

You need "ISO-8859-1" instead of "WHATEVER".

> It looks like you have to actually know the original encoding ahead of time?

Yes, but this is almost certain to be ISO-8859-1, since that's been the
de facto standard for non-localized manpages for many years.

Comment 10 Milan Kerslager 2003-08-06 14:35:00 UTC

WHATEVER means encoding for manpages in your tarball you need to convert to UTF.
English is ASCII. ISO-8859-1 is German, France or so (Western Europe) and these
manpages must belong to apropriate locale. 
Copyright sign should be replaced by (C) to be backward compatible (for core
English manpages) with ASCII terminals and 7-bit transport.

Comment 11 Philip Spencer 2003-08-06 16:10:54 UTC

> ISO-8859-1 is German, France or so (Western Europe)

It's also what's currently used for core English language manpages. (It
shouldn't be, but it is).

> Copyright sign should be replaced by (C) to be backward compatible (for core
> English manpages) with ASCII terminals and 7-bit transport.

Actually it should be replaced by the nroff macro \(co (which will generate
an appropriate copyright symbol in the current locale when the manpage is
typeset).

But that aside, that's not what "backwards compatible" means. We're talking
about the iconv patch RedHat introduced into /usr/bin/nroff which has broken the
display of core English manpages. "Backwards compatible" means that this new
man/nroff software should still be able to handle manpages where the copyright
sign has not yet been replaced by \(co, instead of generating error messages
when you try to display those manpages.

Yes, over the coming years package authors SHOULD start writing manpages where
the core English manpage is in pure ASCII and the localized versions are in the
appropriate directories. They SHOULD replace the copyright symbol with \(co.
They SHOULD use appropriate nroff macros for any non-ASCII symbols needed in the
core English manpage.

But the reality is: they HAVEN'T YET. So the question is, what should the
software do NOW, given the current reality?

The choices are: (a) have the software work both for these existing manpages
and for the ones to come in the future, or (b) have it fail with error messages
on these existing manpages and only work for the ones to come in the future.

My vote is (a). That's what backwards compatibility is all about. See the
attached patch for one possible way to accomplish this.

Comment 12 Stephen John Smoogen 2004-02-19 22:35:28 UTC

Still broke in Enterprise. Moving to that chain.

Comment 18 Red Hat Bugzilla 2006-04-26 21:14:45 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2006-0378.html

Note You need to log in before you can comment on or make changes to this bug.