From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225 Description of problem: The 'iconv' program (part of glibc-common) is croaking on the accented character in the name 'Nicolas Lichtmaier' in the comments of the man page for rand.3 and srand.3. I can read the man page on RH8.0 but not RH9. Version-Release number of selected component (if applicable): glibc-common-2.3.2-27.9 How reproducible: Always Steps to Reproduce: 1.man rand 2. 3. Actual Results: iconv: illegal input sequence at position 1789 and the man page is not displayed. Expected Results: man page should have been displayed. Additional info: setenv LANG C man rand - fail setenv LANG en_US man rand - fail (I haven't tried any other language besides these two)
That's a bug in the man page. It should be UTF-8 encoded, not ISO-8859-1 encoded. Then it will work with any locale with charset which has small a with acute letter (like en_US, de_DE, cs_CZ, en_US.UTF-8), but not in others (like C, ru_RU, etc.). Given that it is in a comment, I'd say the acute should go.
Here are further man(3) pages that generate the same error message, cutting off the man page at the error: DBD::mysql iconv: illegal input sequence at position 42954 DBI::Shell iconv: illegal input sequence at position 9095 QDial iconv: illegal input sequence at position 12877 Unicode::Collate iconv: illegal input sequence at position 19426 encrypt iconv: illegal input sequence at position 115 fclose iconv: illegal input sequence at position 2275 fcloseall iconv: illegal input sequence at position 2279 fflush iconv: illegal input sequence at position 2275 fontconfig iconv: illegal input sequence at position 203 fprintf iconv: illegal input sequence at position 13854 lockf iconv: illegal input sequence at position 115 logwtmp iconv: illegal input sequence at position 115 qdial iconv: illegal input sequence at position 12877 rand iconv: illegal input sequence at position 1787 setkey iconv: illegal input sequence at position 115 snprintf iconv: illegal input sequence at position 13854 sprintf iconv: illegal input sequence at position 13854 srand iconv: illegal input sequence at position 1787 strtok iconv: illegal input sequence at position 1281 strtok_r iconv: illegal input sequence at position 1281 tolower iconv: illegal input sequence at position 1309 toupper iconv: illegal input sequence at position 1309 updwtmp iconv: illegal input sequence at position 115 vfprintf iconv: illegal input sequence at position 13854 vprintf iconv: illegal input sequence at position 13854 vsnprintf iconv: illegal input sequence at position 13854 vsprintf iconv: illegal input sequence at position 13854 There were other kinds of errors, but these were all the iconv errors. To reproduce, type 'man 3 <keyword>'.
There are lots of man page problems like this in at least sections 2 and 3 in RH9. Horking up the man pages for programmers has become a chronic annoyance for me with Red Hat releases. All the pages I've gone back and retested work okay if LANG=en_US.UTF-8. Sections 2 and 3 are for C programmers, so they must work if LANG=C (or LANG=POSIX, or LANG=XOPEN, or with LANG unset). Suggestions: (1) recheck all the man pages as part of standard testing Just write a script that unsets LANG and runs a "man SECT" on each file in /usr/share/man/manSECT/, looking for "iconv" in stderr. Having "man" return a non-zero exit status if it fails would make this testing even easier. :-) (2) Educate the benighted managers in marketing and product test. LANG=POSIX (or LANG=C or ...) is supposed to be the default, at least for testing. These (LANG=POSIX, etc.) are the *only* settings that have a well-defined and portable behavior. They're what programmers count on. Even if nothing else works, these need to to let programmers work. The other stuff's fine, but it isn't required by the basic standards.
comment 3: > LANG=POSIX (or LANG=C or ...) is supposed to be the default, at least for > testing. These (LANG=POSIX, etc.) are the *only* settings that have a > well-defined and portable behavior. They're what programmers count on. No it's not. Even if we play devil's advocate and assume that the whole world speaks English and is happy to get English pages as the default, setting the locale to POSIX/C will still fail for some English man pages that have accent and other characters from ISO-8859-1 that aren't in ASCII (usually the author's names, but not always). glibc interprets the POSIX/C locale as being a seven bit (ASCII only) locale, and certain functions/operations will reject strings that have the eighth bit set. Previous versions of glibc would let chars>127 slide in C/POSIX, but newer versions are more strict. Thus if we made C/POSIX the default, some applications (I'm talking about complicated apps, not shell utilities that don't differentiate between text and binary data) would lose functionality (display/accept accented characters, non-traditional punctuation, etc) even for users of English. The man pages will be filtered so that non-ASCII is either transliterated or converted to ASCII. Given that the behavior of man is to fall back to the "POSIX" locale version of a man page if a man page for the desired language is not available, the POSIX manpages should probably be encoded using a charset that is a subset of all charset/encodings in use (ASCII); while encoding in UTF-8 is the right thing to do as Jakub mentions, users of non-English terminals that can't display the entire Unicode charset will still want to see POSIX pages without having to change the charset/enc/font of their terminal.
comment 4: > the POSIX manpages should probably be encoded using a charset > that is a subset of all charset/encodings in use (ASCII) By "encoded using ... ASCII", I mean that any manpage that is currently in the "POSIX" locale that uses non-ASCII should: 1) be copied and converted from ISO-8859-1 to UTF-8 into a en_US locale-specific man page area, much like how all the other non-English man pages are currently done and 2) be "transliterated" for the POSIX locale. That is, convert the "acute a"s and the umlauts into plain ASCII "a" and "u" respectively. Yes, I know that an umlaut and a "u" are entirely different things and this is going to upset some people who will get their name mangled, but... a) Users prefer man pages in their native language first. But if they can't have that they'd rather have an English man page than no man page at all. A realistic example would be a Japanese user: if the man page isn't available in Japanese, they'd rather have the ASCII/POSIX version as the fallback than no page whatsoever. Currently, because the "fallback" pages are not in ASCII, they sometimes fail with an iconv error because the "a acutes" and the "umlauts" aren't representable in EUC-JP. (Later versions of RHL will use Unicode for CJK just like the Western European languages) b) Users can still see the "pristine source" man pages by setting their LANG to en_US, which is a Unicode locale in RHL 8 and beyond.
Parts of the last two comments are right on the money. Parts confuse "language" with "locale". Here, I think, is the scoop, or at least a sketch thereof: LANG=POSIX does not "assume the whole world speaks English" any more than the marking "Andante" on a page of music assumes the whole world speaks Italian. Similarly, in C, "if" and "else" are keywords, not English -- though their syntax and semantics were probably influenced by Dennis Ritchie's native language. :-) The POSIX (and C, and XOPEN) standards specify the way one particular locale behaves. Just one, actually. Functions and utilities defined in those standards are required to behave in a specific way when LANG (and LC_*) are set in to the values C, POSIX, XOPEN, or unset. These are all, approximately, synonyms -- and with good reason. Unless things have changed recently, all other locales remain undefined and implementation-dependent. When $LANG is set to en_US.UTF-8, RedHat may, legitimately, have a different collation sequence from, say, Slackware, Microsoft, or IBM. I happen to think this isn't good, but it's still true. (Used to be, at least, that even the string describing the locale was implementation-defined; however, language_COUNTRY.CHARSET seems to be a de-facto standard.) Standardizing at least one behavior lets programmers write code whose behavior they can predict. (Indeed, if need be, the programmer can force that behavior, setting one of those locales within the program.) The choice of how to behave in this locale was made based on historical precedent: traditional Unix/C implementations. Yes, this assumes we're not using EUC-J. Yes it assumes ISO-646IRV. No, it's not "everyone speaks English" chauvinism: it also assumes we're not running in EBCDIC or CDC-ASCII or using APL terminals. It wasn't a value judgement, it was a way to maximize portability of existing Unix code and programmers. Anyway, once LANG is unset, "ls" is guaranteed to list files in ASCII collating sequence, and "ln -s [a-z]* subdir" will link files that start with lowercase letters, but not "Makefile", into the subdirectory. If a "complex app" is in one of these standards, it must behave conventionally and successfully with LANG unset. If a programmer works in one of the standard locales, his programs should behave in standard ways. If you create, say, a man page with unusual data, no matter what the contents are -- accented characters, binary, Swahili -- then folks who don't understand those data will just have to uiopreqwjopdfr 389y493tqh, but the utilities that process those pages should still run, and predictably, when LANG is unset. And, because even English words have diacriticals, though rarely, there should legitimately be provision for man pages in "English," which are picked up when the local is en_*, just like man pages in French, or German, or Arabic. The ones that have only ASCII may be links to the default pages. What man, or col, (or nroff, or ...) do on these is, I think, undefined, though I don't have my dot-2 standard at hand. But "man", when LANG is unset should work, no matter what, and must be tested in that locale. Even if no other locale works. Ditto for all the other standard utilities. That a C programmer should get only an error message from "man fclose", when LANG=C, seems like, oh, a bug. :-)
> LANG=POSIX does not "assume the whole world speaks English" True, but it does mean "everything except for ASCII is undefined" with respect to text data. Thus it is reasonable for non-trivial text handlers (such as groff, used to man, to balk and reject text data that it can't tell if it's non-spacing, part of an escape sequence, one part of a multi-byte character, etc.) > Unless things have changed recently, all other locales remain undefined and > implementation-dependent. When $LANG is set to en_US.UTF-8, RedHat may, > legitimately, have a different collation sequence from, say, Slackware, Things HAVE changed recently. LSB-i (formerly the LI18NUX spec) does exercise distros through conformance tests to make sure the collation behavior is the same for certain locales. Those distros that have a collation sequence that is different from what the LSB-i states for a given locale will flunk those I18N standards conformance tests. Current versions of Red Hat Linux have passed level 1 LSB-i/LI18NUX conformance. > If you create, say, a man page with unusual data, [snip], but the utilities > that process those pages should still run, and predictably, when LANG is unset. > But "man", when LANG is unset should work, no matter what, and must be tested > in that locale. Even if no other locale works. > Ditto for all the other standard utilities. You're confusing the data (man-pages) with the program(s) (man and groff). There's no standard that specifies that groff/man must work with any arbitrary data. Man/groff will reject data it can't understand within a certain context; especially because man pages are non-trivial marked up pages. I do not agree with the statement "Ditto for all the other standard utilities". How utilities deal with unknown/invalid data is either specified in specs or undefined. There is no blanket requirement that when LANG is unset "standard utilities" should accept any arbitrary 8-bit data (which is what I interpret you mean with the phrase "should work, no matter what"). When in the POSIX locale ("LANG is unset"), I18N sensitive glibc functions, such as those in <ctype.h>, will fail/report-as-undefined for non-ASCII data; for example, iscntrl() will fail to identify the C1 control characters from 0x80 to 0x9F, functions to determine the width of a character, crucial for groff word-wrapping, will not work. > That a C programmer should get only an error message from "man fclose", > when LANG=C, seems like, oh, a bug. :-) I'm not claiming that the current state is correct; which is why this bug is assigned and not closed. As mentioned above the man-page package will be corrected so that pages with non-ASCII will be ASCII-ized for the sake of graceful fallback, and will also be UTF-ed and moved into en for all pages that contain characters > 0x7E.
Believe it or not, I now think we're in complete agreement. At this point, I'm convinced everywhere that I (or you) thought we disagreed was misunderstanding what the other person was saying. The problem's in the data, and it's a bug.
*** Bug 103122 has been marked as a duplicate of this bug. ***
*** Bug 104542 has been marked as a duplicate of this bug. ***
*** Bug 104570 has been marked as a duplicate of this bug. ***
Even if all Red Hat's man pages are UTF-8, what happens when I install a third-party man page? Man should be able to recover from an illegal UTF-8 sequence, not treat is as a fatal error.
ascii transliteration will be done according to glibc localedata with respect to LANG=en
*** Bug 104991 has been marked as a duplicate of this bug. ***
Just a note that this is still a problem in Fedora Core 1, Test 3. RPM man-pages-1.60-4 (I'm guessing this is the same as Rawhide).
Sorry, forgot to mention that LANG is set to en_AU. with LANG=C, it goes away...
> Sorry, forgot to mention that LANG is set to en_AU. with LANG=C, it goes away... this is not a man-page problem... for some reason the iconv logic is getting confused with respect to the locale "en_AU". Are you running in ISO-8859-1 or the UTF-8 character set? (if you're running RHL 8.0 and beyond, en_AU should be UTF-8 by default)