Description of problem: If the LANG is set to C for various reasons, then man pages will not be properly formatted. This can be seen with the following: LANG=C man toupper iconv: illegal input sequence at position 1311 It dies because there is an UTF-8 character in the page. Tracking down the problem led to the fact that man searches for nroff before groff and the nroff command has the following line in it: iconv -f UTF-8 -t $CONV ${1+"$@"} | PATH=$GROFF_BIN_PATH:$PATH groff -mtty-char $T $opts iconv does not supress warnings and dies when it finds a command it cant understand. LANG=C translates that line to: iconv -f UTF-8 -t ANSI_X3.4-1968 which dies.. adding a iconv -c -f ... allows for the man page to be formatted correctly. Version-Release number of selected component (if applicable): [smooge@glasya smooge]$ rpm -q groff man glibc-common groff-1.18.1-20 man-1.5k-6 glibc-common-2.3.2-24.9 How reproducible: 100%
We see the same problem with LANG=en_US (which is our system's default setting; we cannot use en_US.UTF-8 as the default because many people log in to the system from terminals that are not unicode compatible.) Two comments to add: (1) It's not that the man page contains a UTF-8 character -- rather, it's that the manpage contains a non-ASCII character which is ISO-8859-1 and NOT UTF-8. (2) Adding -c to iconv does not completely solve the problem -- it causes the offending character to omitted altogether, instead of being displayed properly. That line you mentioned in /usr/bin/gnroff is a patch added by RedHat to the standard GNU nroff, and this patch is the cause of the problems. It was presumably intended to deal with the fact that the localized manpages (in /usr/share/man/<locale>/man1 etc.) are in UTF-8 format. It attempts to convert the manpage from UTF-8 to the current encoding before processing it. The problem is that none of the ordinary manpages are in UTF-8. Most are plain ASCII but many (76 on our system) include non-ASCII ISO-8859-1 characters. The nroff script fails on all such manpages because it is treating them as if they were UTF-8. The solution as I see it would be as follows: (1) nroff should not have a hard-coded assumption that its input file is in UTF-8. It should assume its input file is in the current locale unless specified otherwise by some new flag or environment variable. (2) The man program should, if it finds a page in one of the localized directories, either run iconv itself or pass this new flag to nroff. Otherwise, if the man program finds the page in a standard non-localized directory, it should pass it as is to nroff and nroff should not attempt to convert it. (OR ... maybe it should pass nroff a flag telling it the input is ISO-8859-1; that way, if the user's locale is UTF-8, it will get correctly converted to UTF-8). A quick-and-dirty fix to the problem is to add "C", "en_*", and other similar language settings to the set of languages for which no conversion should be done: --- /usr/bin/nroff.orig 2003-08-02 16:53:04.000000000 -0400 +++ /usr/bin/nroff 2003-08-02 14:35:15.000000000 -0400 @@ -97,7 +97,7 @@ case "${LC_ALL-${LC_CTYPE-${LANG}}}" in ru_*.UTF-8 ) T=-Tnippon ;; - ja_JP.ujis | ja_JP.eucJP | ko_KR.eucKR | *.UTF-8) + ja_JP.ujis | ja_JP.eucJP | ko_KR.eucKR | *.UTF-8 | en_* | C) ;; *) CONV="`locale charmap 2>/dev/null`" ;; Obviously, however, this in an incomplete fix.
The proper solution is to have all manpages in UTF-8. Please fill a proper bug report if RH has any. The (1) is wrong because most of languages using locale with various default encoding (ie. cs_CZ has ISO-8859-2 and not UTF-8) and no one know which encoding is used iside manpage.
>The proper solution is to have all manpages in UTF-8. Absolutely not. At least - it may be a good idea for all RedHat-provided manpages to be in UTF-8 for the sake of consistency, but to rely on this would result in an operating system that is incompatible with the man pages of any other non-RH software that may get installed. (Very few non-RH man pages are in UTF-8.) If "man" and "nroff" are incapable of handling input files that are not in UTF-8, they are broken. >Please fill a proper bug report if RH has any. That's exactly what this is. >languages using locale with various default encoding (ie. cs_CZ has ISO-8859-2 >and not UTF-8) and no one know which encoding is used iside manpage. Having a hard-coded assumption inside nroff that it's input file is always UTF-8 is definitely wrong. The iconv patch should probably be backed out of nroff altogether and instead the man program should be provided with a way to know the encoding of the man pages in each directory (perhaps with a config file, or perhaps by the presence of a .encoding file in each man directory which identifies the encoding of the files in that directory) and should perform any desired conversion on them first before passing them to nroff. Whatever "default" assumption man/nroff make about unknown manpages should be configurable by the system administrator.
.encoding file could be cute but RH-only specific You will not solve the problem as the admin must make a file (or recode manpages). But recoding is better as what you would like to do when only few manpages are in different encoding (and you want to not touch current directory structure). The admin should recode all man pages he installs on the system Every other system has a trouble when man pages are in various encoding ie UTF-8 uniformity is the best solution for the future. The switch for input encoding inside nroff could solve your problem.
> he admin should recode all man pages he installs on the system You're joking, right? You seriously think a sysadmin should have to manually recode every manpage whenever installing a package from a tarball? That could be hundreds of manpages! And you seriously think a sysadmin should have to edit the SPEC file of every packaged rpm file, inserting scripts to recode the manpages before packaging? > But recoding is better as what you would like to do when only few manpages are in different encoding It's not "a few" -- in the non-locale-specific directories, it's every single one! NONE of the non-localized manpages provided in RedHat 9 is in UTF-8. There are 76 which are not in plain text, and the vast majority of those are in ISO-8859-1. There are only a few that are in other encodings (a couple of perl manpages and the other ISO-8859-x manpages, primarily). This presumably is also a fair reflection of the current state of other manpages out there. They fall into roughly three categories: (1) Legacy software -- almost all of these manpages are in ISO-8859-1. (2) The occasional piece of software that provides a manpage in another encoding but doesn't store it in a locale-specific directory. (3) Properly localized manpages--these are installed in different directories depending on the locale, and it's reasonable to expect these to all be in UTF-8. Therefore, it is reasonable behaviour for a "man" program to: - Assume manpages in /usr/share/man/man* are in ISO-8859-1 - Assume manpages in /usr/share/man/<locale>/man* are in UTF-8 - Allow a sysadmin to override these assumptions for exceptional files/directories. Those overrides are few and far between, as compared to the work of converting everything in /usr/share/man/man* from ISO-8859-1 to UTF-8. [ An installation with a lot of locally developed software in a non-ISO-8859-1 environment may have a few more of these exceptions to deal with, but I think it's a fair assumption that people writing software in non-English languages are much more familiar with localization issues and will much more quickly start putting their manpages into locale-specific directories than will developers of English language software.] > uniformity is the best solution for the future. The above strategy DOES permit a move to future uniformity: in the future, all manpages will be properly localized and stored in /usr/share/man/<locale>/man*. Nothing will be left in /usr/share/man/man* except pure encoding-independent ASCII; everything else will be in /usr/share/man/<locale>/man*. This strategy also preserves backward compatibility: - It works now, since now most manpages in /usr/share/man/man* are ISO-8859-1. - It will work in the future, since then those manpages will be pure ASCII and it doesn't matter what encoding you read them in. Your suggested strategy, of having man assume everything everywhere is in UTF-8 and re-encoding all manpages which are not, will work in the future but will not work now without a lot of extra effort on the part of system administrators everywhere. Given the choice between a strategy which works both now and in the future, or a strategy which will work in the future but requires a lot of extra effort to make it work now, I think the choice is clear.r
Created attachment 93421 [details] A possible way to let man run directory-dependent iconv on manpage
Don't get me wrong, but with bash, sed, awk or perl this is 4 lines script (or 1 line). As there is 30-years default with assumption that all manpages are ASCII there is no reason why not change this standard to UTF-8 (ASCII compatible) without making another RH-ism that (probably) no other Unix has. If you really need *every* man page marked with another encoding, you are free to modify man scripts to fit your needs. But - marking every manpage or making encoding-depandent directories is more complicated than simple use of iconv: Before make install do: for $(find man*); do iconv -f WHATEVER -t UTF-8 $i > $i.bak mv -f $i.bak $i done
For one possibly related issue (not sure -- I don't understand encoding well enough), see Bug 99311 (not all will have access to this bug) On RHL 9, the X man pages display On Cambridge alpha / beta, many don't The ones which don't display won't display b/c they contain lines like: .\" .\" Copyright <a9> 2001 Keith Packard, member of The XFree86 Project, Inc. and the <a9> encoding of the copyright symbol kills iconv.... BTW, Milan, when I try your script, I get: [kaboom@skuld kaboom]$ iconv -f WHATEVER -t UTF-8 xrandr.1x > xrandr.1x.foo iconv: conversion from `WHATEVER' is not supported [kaboom@skuld kaboom]$ It looks like you have to actually know the original encoding ahead of time?
> As there is 30-years default with assumption that all manpages are ASCII The problem is, THAT ASSUMPTION IS NOT VALID. Rather, the unwritten default has been that manpages are ISO-8859-1. (The 76 non-ASCII manpages in the RH9 distribution testify to that). It shouldn't have been, but the fact is, it is. Trying to treat them all as UTF-8 breaks all those existing manpages. Treating these old man pages as ISO-8859-1, and treating new ones in the proper locale-dependent directories as UTF-8, does not break anything. So why not go that route? Yes, it's a simple script, but if you manage a site with a lot of machines it's still extremely cumbersome (especially keeping track of which pages it's been run on and which it hasn't). > and the <a9> encoding of the copyright symbol kills iconv That's exactly the same issue. Hex a9 is the ISO-8859-1 encoding for copyright. Another example of a manpage in ISO-8859-1. > iconv -f WHATEVER -t UTF-8 xrandr.1x You need "ISO-8859-1" instead of "WHATEVER". > It looks like you have to actually know the original encoding ahead of time? Yes, but this is almost certain to be ISO-8859-1, since that's been the de facto standard for non-localized manpages for many years.
WHATEVER means encoding for manpages in your tarball you need to convert to UTF. English is ASCII. ISO-8859-1 is German, France or so (Western Europe) and these manpages must belong to apropriate locale. Copyright sign should be replaced by (C) to be backward compatible (for core English manpages) with ASCII terminals and 7-bit transport.
> ISO-8859-1 is German, France or so (Western Europe) It's also what's currently used for core English language manpages. (It shouldn't be, but it is). > Copyright sign should be replaced by (C) to be backward compatible (for core > English manpages) with ASCII terminals and 7-bit transport. Actually it should be replaced by the nroff macro \(co (which will generate an appropriate copyright symbol in the current locale when the manpage is typeset). But that aside, that's not what "backwards compatible" means. We're talking about the iconv patch RedHat introduced into /usr/bin/nroff which has broken the display of core English manpages. "Backwards compatible" means that this new man/nroff software should still be able to handle manpages where the copyright sign has not yet been replaced by \(co, instead of generating error messages when you try to display those manpages. Yes, over the coming years package authors SHOULD start writing manpages where the core English manpage is in pure ASCII and the localized versions are in the appropriate directories. They SHOULD replace the copyright symbol with \(co. They SHOULD use appropriate nroff macros for any non-ASCII symbols needed in the core English manpage. But the reality is: they HAVEN'T YET. So the question is, what should the software do NOW, given the current reality? The choices are: (a) have the software work both for these existing manpages and for the ones to come in the future, or (b) have it fail with error messages on these existing manpages and only work for the ones to come in the future. My vote is (a). That's what backwards compatibility is all about. See the attached patch for one possible way to accomplish this.
Still broke in Enterprise. Moving to that chain.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2006-0378.html