Red Hat Bugzilla – Bug 103214
'man rand' gives 'iconv: illegal input sequence at position 1789'
Last modified: 2007-04-18 12:57:11 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2.1) Gecko/20030225
Description of problem:
The 'iconv' program (part of glibc-common) is croaking on the accented character
in the name 'Nicolas Lichtmaier' in the comments of the man page for rand.3 and
I can read the man page on RH8.0 but not RH9.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Actual Results: iconv: illegal input sequence at position 1789
and the man page is not displayed.
Expected Results: man page should have been displayed.
setenv LANG C
man rand - fail
setenv LANG en_US
man rand - fail
(I haven't tried any other language besides these two)
That's a bug in the man page. It should be UTF-8 encoded, not ISO-8859-1 encoded.
Then it will work with any locale with charset which has small a with acute letter
(like en_US, de_DE, cs_CZ, en_US.UTF-8), but not in others (like C, ru_RU, etc.).
Given that it is in a comment, I'd say the acute should go.
Here are further man(3) pages that generate the same error message, cutting off
the man page at the error:
DBD::mysql iconv: illegal input sequence at position 42954
DBI::Shell iconv: illegal input sequence at position 9095
QDial iconv: illegal input sequence at position 12877
Unicode::Collate iconv: illegal input sequence at position 19426
encrypt iconv: illegal input sequence at position 115
fclose iconv: illegal input sequence at position 2275
fcloseall iconv: illegal input sequence at position 2279
fflush iconv: illegal input sequence at position 2275
fontconfig iconv: illegal input sequence at position 203
fprintf iconv: illegal input sequence at position 13854
lockf iconv: illegal input sequence at position 115
logwtmp iconv: illegal input sequence at position 115
qdial iconv: illegal input sequence at position 12877
rand iconv: illegal input sequence at position 1787
setkey iconv: illegal input sequence at position 115
snprintf iconv: illegal input sequence at position 13854
sprintf iconv: illegal input sequence at position 13854
srand iconv: illegal input sequence at position 1787
strtok iconv: illegal input sequence at position 1281
strtok_r iconv: illegal input sequence at position 1281
tolower iconv: illegal input sequence at position 1309
toupper iconv: illegal input sequence at position 1309
updwtmp iconv: illegal input sequence at position 115
vfprintf iconv: illegal input sequence at position 13854
vprintf iconv: illegal input sequence at position 13854
vsnprintf iconv: illegal input sequence at position 13854
vsprintf iconv: illegal input sequence at position 13854
There were other kinds of errors, but these were all the iconv errors. To
reproduce, type 'man 3 <keyword>'.
There are lots of man page problems like this in at least sections 2 and 3
in RH9. Horking up the man pages for programmers has become a chronic
annoyance for me with Red Hat releases.
All the pages I've gone back and retested work okay if
LANG=en_US.UTF-8. Sections 2 and 3 are for C programmers, so they
must work if LANG=C (or LANG=POSIX, or LANG=XOPEN, or with LANG unset).
(1) recheck all the man pages as part of standard testing
Just write a script that unsets LANG and runs a "man SECT" on each
file in /usr/share/man/manSECT/, looking for "iconv" in stderr.
Having "man" return a non-zero exit status if it fails would make
this testing even easier. :-)
(2) Educate the benighted managers in marketing and product test.
LANG=POSIX (or LANG=C or ...) is supposed to be the default, at least for
testing. These (LANG=POSIX, etc.) are the *only* settings that have a
well-defined and portable behavior. They're what programmers count on.
Even if nothing else works, these need to
to let programmers work. The other stuff's fine, but it isn't required
by the basic standards.
> LANG=POSIX (or LANG=C or ...) is supposed to be the default, at least for
> testing. These (LANG=POSIX, etc.) are the *only* settings that have a
> well-defined and portable behavior. They're what programmers count on.
No it's not. Even if we play devil's advocate and assume that the whole world
speaks English and is happy to get English pages as the default, setting the
locale to POSIX/C will still fail for some English man pages that have accent
and other characters from ISO-8859-1 that aren't in ASCII (usually the author's
names, but not always). glibc interprets the POSIX/C locale as being a seven bit
(ASCII only) locale, and certain functions/operations will reject strings that
have the eighth bit set. Previous versions of glibc would let chars>127 slide in
C/POSIX, but newer versions are more strict.
Thus if we made C/POSIX the default, some applications (I'm talking about
complicated apps, not shell utilities that don't differentiate between text and
binary data) would lose functionality (display/accept accented characters,
non-traditional punctuation, etc) even for users of English.
The man pages will be filtered so that non-ASCII is either transliterated or
converted to ASCII. Given that the behavior of man is to fall back to the
"POSIX" locale version of a man page if a man page for the desired language is
not available, the POSIX manpages should probably be encoded using a charset
that is a subset of all charset/encodings in use (ASCII); while encoding in
UTF-8 is the right thing to do as Jakub mentions, users of non-English terminals
that can't display the entire Unicode charset will still want to see POSIX pages
without having to change the charset/enc/font of their terminal.
> the POSIX manpages should probably be encoded using a charset
> that is a subset of all charset/encodings in use (ASCII)
By "encoded using ... ASCII", I mean that any manpage that is currently in the
"POSIX" locale that uses non-ASCII should:
1) be copied and converted from ISO-8859-1 to UTF-8 into a en_US locale-specific
man page area, much like how all the other non-English man pages are currently done
2) be "transliterated" for the POSIX locale. That is, convert the "acute a"s and
the umlauts into plain ASCII "a" and "u" respectively. Yes, I know that an
umlaut and a "u" are entirely different things and this is going to upset some
people who will get their name mangled, but...
a) Users prefer man pages in their native language first. But if they can't have
that they'd rather have an English man page than no man page at all. A realistic
example would be a Japanese user: if the man page isn't available in Japanese,
they'd rather have the ASCII/POSIX version as the fallback than no page
whatsoever. Currently, because the "fallback" pages are not in ASCII, they
sometimes fail with an iconv error because the "a acutes" and the "umlauts"
aren't representable in EUC-JP. (Later versions of RHL will use Unicode for CJK
just like the Western European languages)
b) Users can still see the "pristine source" man pages by setting their LANG to
en_US, which is a Unicode locale in RHL 8 and beyond.
Parts of the last two comments are right on the money. Parts
confuse "language" with "locale".
Here, I think, is the scoop, or at least a sketch thereof:
LANG=POSIX does not "assume the whole world speaks English" any more than the
marking "Andante" on a page of music assumes the whole world speaks Italian.
Similarly, in C, "if" and "else" are keywords, not English -- though
their syntax and semantics were probably influenced by Dennis Ritchie's native
The POSIX (and C, and XOPEN) standards specify the way one particular locale
Just one, actually.
Functions and utilities defined in those standards are required to behave in a
specific way when LANG (and LC_*) are set in to the values C, POSIX, XOPEN, or
unset. These are all, approximately, synonyms -- and with good reason.
Unless things have changed recently, all other locales remain undefined and
implementation-dependent. When $LANG is set to en_US.UTF-8, RedHat may,
legitimately, have a different collation sequence from, say, Slackware,
Microsoft, or IBM. I happen to think this isn't good, but it's still true.
(Used to be, at least, that even the string describing the locale was
implementation-defined; however, language_COUNTRY.CHARSET seems to be a
Standardizing at least one behavior lets programmers write code whose behavior
they can predict. (Indeed, if need be, the programmer can force that behavior,
setting one of those locales within the program.)
The choice of how to behave in this locale was made based on historical
precedent: traditional Unix/C implementations.
Yes, this assumes we're not using EUC-J. Yes it assumes ISO-646IRV. No, it's
not "everyone speaks English" chauvinism: it also assumes we're not running in
EBCDIC or CDC-ASCII or using APL terminals. It wasn't a value judgement, it
was a way to maximize portability of existing Unix code and programmers.
Anyway, once LANG is unset, "ls" is guaranteed to list files in ASCII collating
sequence, and "ln -s [a-z]* subdir" will link files that start with lowercase
letters, but not "Makefile", into the subdirectory.
If a "complex app" is in one of these standards, it must behave conventionally
and successfully with LANG unset. If a programmer works in one of the standard
locales, his programs should behave in standard ways.
If you create, say, a man page with unusual data, no matter what the contents
are -- accented characters, binary, Swahili -- then folks who don't understand
those data will just have to uiopreqwjopdfr 389y493tqh, but the utilities that
process those pages should still run, and predictably, when LANG is unset.
And, because even English words have diacriticals, though rarely, there should
legitimately be provision for man pages in "English," which are picked up when
the local is en_*, just like man pages in French, or German, or Arabic. The
ones that have only ASCII may be links to the default pages. What man, or col,
(or nroff, or ...) do on these is, I think, undefined, though I don't have my
dot-2 standard at hand.
But "man", when LANG is unset should work, no matter what, and must be tested
in that locale. Even if no other locale works.
Ditto for all the other standard utilities.
That a C programmer should get only an error message from "man fclose",
when LANG=C, seems like, oh, a bug. :-)
> LANG=POSIX does not "assume the whole world speaks English"
True, but it does mean "everything except for ASCII is undefined" with respect
to text data. Thus it is reasonable for non-trivial text handlers (such as
groff, used to man, to balk and reject text data that it can't tell if it's
non-spacing, part of an escape sequence, one part of a multi-byte character, etc.)
> Unless things have changed recently, all other locales remain undefined and
> implementation-dependent. When $LANG is set to en_US.UTF-8, RedHat may,
> legitimately, have a different collation sequence from, say, Slackware,
Things HAVE changed recently. LSB-i (formerly the LI18NUX spec) does exercise
distros through conformance tests to make sure the collation behavior is the
same for certain locales. Those distros that have a collation sequence that is
different from what the LSB-i states for a given locale will flunk those I18N
standards conformance tests.
Current versions of Red Hat Linux have passed level 1 LSB-i/LI18NUX conformance.
> If you create, say, a man page with unusual data, [snip], but the utilities
> that process those pages should still run, and predictably, when LANG is unset.
> But "man", when LANG is unset should work, no matter what, and must be tested
> in that locale. Even if no other locale works.
> Ditto for all the other standard utilities.
You're confusing the data (man-pages) with the program(s) (man and groff).
There's no standard that specifies that groff/man must work with any arbitrary
data. Man/groff will reject data it can't understand within a certain context;
especially because man pages are non-trivial marked up pages.
I do not agree with the statement "Ditto for all the other standard utilities".
How utilities deal with unknown/invalid data is either specified in specs or
undefined. There is no blanket requirement that when LANG is unset "standard
utilities" should accept any arbitrary 8-bit data (which is what I interpret you
mean with the phrase "should work, no matter what").
When in the POSIX locale ("LANG is unset"), I18N sensitive glibc functions, such
as those in <ctype.h>, will fail/report-as-undefined for non-ASCII data; for
example, iscntrl() will fail to identify the C1 control characters from 0x80 to
0x9F, functions to determine the width of a character, crucial for groff
word-wrapping, will not work.
> That a C programmer should get only an error message from "man fclose",
> when LANG=C, seems like, oh, a bug. :-)
I'm not claiming that the current state is correct; which is why this bug is
assigned and not closed. As mentioned above the man-page package will be
corrected so that pages with non-ASCII will be ASCII-ized for the sake of
graceful fallback, and will also be UTF-ed and moved into en for all pages that
contain characters > 0x7E.
Believe it or not, I now think we're in complete agreement.
At this point, I'm convinced
everywhere that I (or you) thought we disagreed was
misunderstanding what the other person was saying.
The problem's in the data, and it's a bug.
*** Bug 103122 has been marked as a duplicate of this bug. ***
*** Bug 104542 has been marked as a duplicate of this bug. ***
*** Bug 104570 has been marked as a duplicate of this bug. ***
Even if all Red Hat's man pages are UTF-8, what happens when I install a
third-party man page? Man should be able to recover from an illegal UTF-8
sequence, not treat is as a fatal error.
ascii transliteration will be done according to glibc localedata with respect to
*** Bug 104991 has been marked as a duplicate of this bug. ***
Just a note that this is still a problem in Fedora Core 1, Test 3. RPM
man-pages-1.60-4 (I'm guessing this is the same as Rawhide).
Sorry, forgot to mention that LANG is set to en_AU. with LANG=C, it goes away...
> Sorry, forgot to mention that LANG is set to en_AU. with LANG=C, it goes away...
this is not a man-page problem... for some reason the iconv logic is getting
confused with respect to the locale "en_AU".
Are you running in ISO-8859-1 or the UTF-8 character set? (if you're running RHL
8.0 and beyond, en_AU should be UTF-8 by default)