It seems like the recomended way for an application to find out weather it
is ok to output UTF-8 encoded chars to use the nl_langinfo(CODESET)
call. If a user wants to communicate that her setup interprets UTF-8 data
correctly (for example, xterm -u8 is used) the logical thing to do would be
to set LANG to for example sv_SE.UTF-8. In theory that is. It turns out
that glibc seems to discard the codeset information if the given locale is
not encoded in that locale. By default
a redhat-7 system returns ISO-8859-1 when nl_langinfo(CODESET) is called
even if LANG is set to sv_SE.UTF-8.
A workaround is to use create a private locale with the right
'codesetlocaledef -v -c -i sv_SE -f UTF-8 $HOME/locales/sv_SE.UTF-8'
and then set LOCPATH for example like this:
[noa@nestor locales]$ LOCPATH=$HOME/locales LANG=sv_SE.UTF-8 ~/c/cs_test
Conclusion: the codeset part of LANG or LC_CTYPE/LC_ALL should reflect to
what nl_langinfo(CODESET) returned even if the locale in question is not
encoded in that codeset.
Actually it is not.
nl_langinfo(CODESET) informs you in which charset are all other nl_langinfo
values encoded. Ulrich Drepper said he'll actually change glibc today so that
it does not accept locale setting like sv_SE.UTF-8 unless sv_SE locale is
encoded in UTF-8 character set (the other way around this would be translating
the locales on the fly, but that's expensive).
What you should basically do is first check the $OUTPUT_CHARSET variable
and if it is not given, try nl_langinfo(CODESET).
If the currently set locale uses the UTF-8 character encoding, then all standard
input/output/error communication, all file names, and all data in plaintext
files and pipes for which no other encoding is specified explicitely etc. should
be in UTF-8.
It is correct that the recomended way for an application to find out whether the
current locale uses UTF-8 is to use a test like
strcmp(nl_langinfo(CODESET), "UTF-8") == 0
For details, please read
To test on the command line, which encoding you have set, use
If you set LANG=sv_SE.UTF-8 on a system where no "sv_SE.UTF-8" locale is
installed, then glibc will just silently stay in the "C" locale, unless the
program checked the return value of setlocale() properly. This is what happened
For instance SuSE Linux 7.1 installs by default the precompiled UTF-8 locales
de_DE.UTF-8 el_GR.UTF-8 en_GB.UTF-8 en_US.UTF-8 fa_IR.UTF-8 fr_FR.UTF-8
hi_IN.UTF-8 ja_JP.UTF-8 ko_KR.UTF-8 mr_IN.UTF-8 ru_RU.UTF-8 vi_VN.UTF-8
in /usr/share/locale/. Use one of these.
If your favourite locale it not among these (or your distribution still lacks
preinstalled UTF-8 locales), no problem:
Any non-root user can easily use for instance
localedef -v -c -i da_DK -f UTF-8 $HOME/local/locale/da_DK.UTF-8
to generate and activate for instance a Danish UTF-8 locale. The root user can
easily add a new UTF-8 locale for all users via
localedef -v -c -i da_DK -f UTF-8 /usr/share/locale/da_DK.UTF-8
and if root wants to make da_DK.UTF-8 the system-wide default locale for every
user, then adding the line
into /etc/profile will do the trick.
If you start xterm (XFree86 4.0.2 or newer) after setting LANG to a UTF-8
locale, it will go automatically into UTF-8 mode.
If you have further questions on this matter, please consult the experts on the
linux-utf8 mailing list:
The bug report speaks about behaviour when the requested locale is not
present in the system, and at so far glibc does not fall back to
"C" when it cannot find proper charset:
char *l = setlocale(LC_ALL, "cs_CZ.UTF-8");
printf ("%s %s\n", l, nl_langinfo(CODESET));
gives cs_CZ.UTF-8 ISO-8859-2
on glibc 2.2.2 and
on glibc 2.1.3.
It has been changed in glibc a few minutes ago.