Bug 34176

Summary:	LANG/LC_* can not be used to indicate UTF-8 display-support to apps
Product:	[Retired] Red Hat Linux	Reporter:	Daniel Resare <noa-bugzilla-redhat>
Component:	glibc	Assignee:	Jakub Jelinek <jakub>
Status:	CLOSED NOTABUG	QA Contact:	Aaron Brown <abrown>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	7.0	CC:	fweimer
Target Milestone:	---
Target Release:	---
Hardware:	i386
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2001-03-30 22:34:19 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Daniel Resare 2001-03-30 22:34:01 UTC

It seems like the recomended way for an application to find out weather it
is ok to output UTF-8 encoded chars to use the nl_langinfo(CODESET)
call. If a user wants to communicate that her setup interprets UTF-8 data
correctly (for example, xterm -u8 is used) the logical thing to do would be
to set LANG to for example sv_SE.UTF-8. In theory that is. It turns out
that glibc seems to discard the codeset information if the given locale is
not encoded in that locale. By default
a redhat-7 system returns ISO-8859-1 when nl_langinfo(CODESET) is called
even if LANG is set to sv_SE.UTF-8.

A workaround is to use create a private locale with the right
'codesetlocaledef -v -c -i sv_SE -f UTF-8 $HOME/locales/sv_SE.UTF-8'
and then set LOCPATH for example like this:
[noa@nestor locales]$ LOCPATH=$HOME/locales LANG=sv_SE.UTF-8 ~/c/cs_test
codeset: UTF-8

Conclusion: the codeset part of LANG or LC_CTYPE/LC_ALL should reflect to
what nl_langinfo(CODESET) returned even if the locale in question is not
encoded in that codeset.

Comment 1 Jakub Jelinek 2001-04-05 17:10:25 UTC

Actually it is not.
nl_langinfo(CODESET) informs you in which charset are all other nl_langinfo
values encoded. Ulrich Drepper said he'll actually change glibc today so that
it does not accept locale setting like sv_SE.UTF-8 unless sv_SE locale is
encoded in UTF-8 character set (the other way around this would be translating
the locales on the fly, but that's expensive).
What you should basically do is first check the $OUTPUT_CHARSET variable
and if it is not given, try nl_langinfo(CODESET).

Comment 2 Markus Kuhn 2001-04-05 21:28:17 UTC

If the currently set locale uses the UTF-8 character encoding, then all standard
input/output/error communication, all file names, and all data in plaintext
files and pipes for which no other encoding is specified explicitely etc. should
be in UTF-8.

It is correct that the recomended way for an application to find out whether the
current locale uses UTF-8 is to use a test like

  strcmp(nl_langinfo(CODESET), "UTF-8") == 0

For details, please read

  http://www.cl.cam.ac.uk/~mgk25/unicode.html#activate

To test on the command line, which encoding you have set, use

  locale charmap

If you set LANG=sv_SE.UTF-8 on a system where no "sv_SE.UTF-8" locale is
installed, then glibc will just silently stay in the "C" locale, unless the
program checked the return value of setlocale() properly. This is what happened
to you.

For instance SuSE Linux 7.1 installs by default the precompiled UTF-8 locales

  de_DE.UTF-8 el_GR.UTF-8 en_GB.UTF-8 en_US.UTF-8 fa_IR.UTF-8 fr_FR.UTF-8
  hi_IN.UTF-8 ja_JP.UTF-8 ko_KR.UTF-8 mr_IN.UTF-8 ru_RU.UTF-8 vi_VN.UTF-8
  zh_CN.UTF-8 zh_TW.UTF-8

in /usr/share/locale/. Use one of these.

If your favourite locale it not among these (or your distribution still lacks
preinstalled UTF-8 locales), no problem:

Any non-root user can easily use for instance

  localedef -v -c -i da_DK -f UTF-8 $HOME/local/locale/da_DK.UTF-8
  export LOCPATH=$HOME/local/locale
  export LANG=da_DK.UTF-8

to generate and activate for instance a Danish UTF-8 locale. The root user can
easily add a new UTF-8 locale for all users via

  localedef -v -c -i da_DK -f UTF-8 /usr/share/locale/da_DK.UTF-8

and if root wants to make da_DK.UTF-8 the system-wide default locale for every
user, then adding the line

   export LANG=da_DK.UTF-8

into /etc/profile will do the trick. 

If you start xterm (XFree86 4.0.2 or newer) after setting LANG to a UTF-8
locale, it will go automatically into UTF-8 mode.

If you have further questions on this matter, please consult the experts on the
linux-utf8 mailing list:

  http://www.cl.cam.ac.uk/~mgk25/unicode.html#lists

Comment 3 Jakub Jelinek 2001-04-06 17:55:56 UTC

The bug report speaks about behaviour when the requested locale is not
present in the system, and at so far glibc does not fall back to
"C" when it cannot find proper charset:
#include <locale.h>
#include <langinfo.h>

int main(void)
{
  char *l = setlocale(LC_ALL, "cs_CZ.UTF-8");
  printf ("%s %s\n", l, nl_langinfo(CODESET));
}
gives cs_CZ.UTF-8 ISO-8859-2
on glibc 2.2.2 and
cs_CZ ISO-8859-2
on glibc 2.1.3.
It has been changed in glibc a few minutes ago.