243541 – default encoding is ascii, should be UTF-8, produces exceptions for i18n applications

Bug 243541 - default encoding is ascii, should be UTF-8, produces exceptions for i18n applications

Summary: default encoding is ascii, should be UTF-8, produces exceptions for i18n appl...

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	python
Sub Component:
Version:	12
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	---
Assignee:	Dave Malcolm
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2007-06-09 16:05 UTC by John Dennis
Modified:	2010-01-22 16:15 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	557241 (view as bug list)
Environment:
Last Closed:	2010-01-22 16:15:20 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description John Dennis 2007-06-09 16:05:43 UTC

Python when it outputs unicode strings will automatically translate them into
the default system encoding. The default encoding is set in site.py and cannot
be overriden by the user, once set in site.py it is locked. In Fedora and RHEL
our default encoding is UTF-8. This is normally set via login scripts in
/etc/profile.d. Thu user if they wish may choose to override the system default.
In both instances the default language and encoding is exported via an
environment variable.

In site.py there is code to allow the default encoding to be set from the locale
information discussed above, however this functionality is turned off and
instead is hardcoded to be ascii. This is clearly wrong IMHO. A typical
consequence of this is a i18n python application using unicode strings will
fault with encoding exceptions when it tries to output any of its unicode
strings. The reason string output will throw exceptions is because the default
encoding is ascii, internally CPython will convert the unicode string using the
default codec (ascii) which of course will fail if the unicode string contains
characters outside the asckii character set, which is highly likely in non-latin
languages.

If the default encoding was UTF-8, as it should be by default to match the rest
of our environment the the encoding translations from Pythons internal UCS-4
Unicode to UTF-8 would succeed. I have personally tested and verified this works . 

Also, one should take into account that ascii is identical to UTF-8 by design
when the set of characters is composed only from the ascii character set.
Therefore which placed ascii strings into Python's unicode strings will not see
a regression. Applications which used i18n unicode strings previously could only
have worked correctly if they were manually encoding to UTF-8 on every output
call, they should also see no regression. Applications which load unicode
strings from translation catalogs would never have worked correctly and will now
work.

Note, the only way existing applications could have worked correctly is:

1) They load unicode strings and manuall convert to UTF-8 on output (correct
default encoding removes the need for manual conversion on every output call).

2) The load their i18n strings from message catalog in UTF-8 format. This is
typically specified as the codeset parameter in
gettext.bind_textdomain_codeset() or gettext.install(). In this case the strings
loaded from the catelog ARE NOT UNIICODE (python has an explicit string type
called unicode which in our builds is UCS-4) normal python strings are
represented as 'str' objects. When gettext is told to return strings via _()
using the UTF-8 codeset python represents them as 'str' not 'unicode', in other
words they are sequences of octets. When output the default encoding is not not
applied because they are not unicode strings, rather they are vanilla strings.
Thus output works in our environment because their entire lifetime in python is
as UTF-8.

However, there are many good reasons to work with i18n strings as unicode, not
byte sequences which happen to be represented as UTF-8 (e.g. can't count the
number of characters, can't concatenate, etc.). Thus applications should be able
to represent their i18n strings as unicode (internally as UCS-4) and output
correctly with correct translation to UTF-8 automatically applied by python, not
manually.

This is from site.py. Note the hardcoding of 'ascii'. If the first 'if 0:' test
allowed locale.getdefaultlocale() to be called it would allow the default
encoding to be correctly set from the environment. Site.py should be patched to
allow this.

def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "ascii" # Default value set by _PyUnicode_Init()
    if 0:
        # Enable to support locale aware default string encodings.
        import locale
        loc = locale.getdefaultlocale()
        if loc[1]:
            encoding = loc[1]
    if 0:
        # Enable to switch off string to Unicode coercion and implicit
        # Unicode to string conversion.
        encoding = "undefined"
    if encoding != "ascii":
        # On Non-Unicode builds this will raise an AttributeError...
        sys.setdefaultencoding(encoding) # Needs Python Unicode build !

Comment 1 Bug Zapper 2008-05-14 12:58:12 UTC

This message is a reminder that Fedora 7 is nearing the end of life. Approximately 30 (thirty) days from now Fedora will stop maintaining and issuing updates for Fedora 7. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '7'.

Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 7's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 7 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora please change the 'version' of this bug. If you are unable to change the version, please add a comment here and someone will do it for you.

Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. If possible, it is recommended that you try the newest available Fedora distribution to see if your bug still exists.

Please read the Release Notes for the newest Fedora distribution to make sure it will meet your needs:
http://docs.fedoraproject.org/release-notes/

The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 2 James Antill 2008-05-14 15:45:28 UTC

 There's just no way I'm going to diverge from upstream on this ... yes, i18n is
horrible to work with in python. I'm sure I hate it more than you do. However
there are some ways to work around it, and this is far from the only problem.

 Look at what recent yum does for my advise on what to do (basically override
the global defaults for encoding and error type).

 Feel free to re-ask for this from upstream, although everyone tells me that
python-3000 will be much better (I'm not holding my breath).

Comment 3 Dave Malcolm 2009-12-14 20:50:37 UTC

John: I spent some time reviewing this today; here are my notes:

Looking over the source history in upstream's Subversion:
  -  the site.py hook to set the default encoding from the locale was added on June 7th 2000 in rev 15634:
'Added support to set the default encoding of strings
at startup time to the values defined by the C locale...'
  - http://svn.python.org/view?view=rev&revision=15634

  - the code was disabled by default 5 weeks later on July 15th 2000 in rev 16374 by effbot (Fredrik Lundh):
-- changed default encoding to "ascii".  you can still change
   the default via site.py...:
http://svn.python.org/view?view=rev&revision=16374

  - and the code was optimized two months later on Sept 18th 2000 in rev 17513, to only set it if it's changed:
  http://svn.python.org/view?view=rev&revision=17513

Looking over upstream mailing list archives for this period:
[Python-Dev] changing the locale.py interface?: Fredrik Lundh <effbot> http://mail.python.org/pipermail/python-dev/2000-July/005827.html
followed by:
http://mail.python.org/pipermail/python-dev/2000-July/005954.html "ascii default encoding": http://mail.python.org/pipermail/python-dev/2000-July/006724.html
(unfortunately side-tracked into a debate of "deprecated" vs "depreciated"); I may have missed some of the discussion though.


The actual affect of calling: sys.setdefaultencoding:
It is defined in Python/sysmodule.c, it calls PyUnicode_SetDefaultEncoding(encoding) on the string "encoding"
PyUnicode_SetDefaultEncoding is defined in Objects/unicodeobject.c; it has this code:
    /* Make sure the encoding is valid. As side effect, this also
       loads the encoding into the codec registry cache. */
    v = _PyCodec_Lookup(encoding);
then copies the encoding into the buffer: "unicode_default_encoding"; this buffer supplies the return value for PyUnicode_GetDefaultEncoding(), which is used in many places inside the unicode implementation, plus in bytearrayobject.c: bytearray_decode()
 and in stringobject.c: PyString_AsDecodedObject()
                        PyString_AsEncodedObject()
so it would seem that there's at least some risk in changing this setting.

To add to the confusion, Py_InitializeEx sets up the encoding of each of stdout, stderr, stdin to the default locale encoding (UTF-8), _provided_ they are connected to a tty:
#0  PyFile_SetEncodingAndErrors (f=0xb7fc5020, enc=0x80edc28 "UTF-8", errors=0x0) at Objects/fileobject.c:458
#1  0x04fbdd49 in Py_InitializeEx (install_sigs=<value optimized out>) at Python/pythonrun.c:322
#2  0x04fbe29e in Py_Initialize () at Python/pythonrun.c:359
#3  0x04fc9886 in Py_Main (argc=<value optimized out>, argv=<value optimized out>) at Modules/main.c:512
#4  0x080485c7 in main (argc=<value optimized out>, argv=<value optimized out>) at Modules/python.c:23

which means that a simple case (printing lower case greek alpha, beta, gamma) works when run directly:
[david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"'
αβγ
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'UTF-8'
>>> sys.stderr.encoding
'UTF-8'

...but fails if you pipe it to a file or redirected into "less":
python -c 'print u"\u03b1\u03b2\u03b3"' > foo.txt
[david@brick ~]$ python -c 'print u"\u03b1\u03b2\u03b3"' | less
Traceback (most recent call last):
  File "<string>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-2: ordinal not in range(128)

Comment 4 Dave Malcolm 2010-01-06 22:28:34 UTC

I'm thinking about making this change, but considering the possible impact, it feels like a feature page (due to the amount of testing required).

I've started writing it up here:
https://fedoraproject.org/wiki/DaveMalcolm/PythonEncodingUsesSystemLocale

Comment 6 Dave Malcolm 2010-01-20 19:38:45 UTC

I've marked https://fedoraproject.org/wiki/Features/PythonEncodingUsesSystemLocale
as "FeatureReadyForWrangler".

I want to make this change soon to maximize testing, but we've been without a rawhide for a few days - I plan to do it once rawhide is up and testable again.

Comment 7 Dave Malcolm 2010-01-22 16:15:20 UTC

I raised this on the upstream mailing list (python-dev), and a patch to change the tty/non-tty variation (http://bugs.python.org/issue7745)

Upstream strongly requested that I not make this change, and I'm going to honor that request; I've withdrawn the feature request mentioned in comment #6.

An (over) simplified summary is that the situation is what it is, and that consistency of behavior between different downstream distributions is more important than making changes at this point in the lifecycle of Python 2.  Upstream feel that attempting to print a <unicode> instance containing code points > U+007F to a standard stream can be wrong when that stream isn't a terminal, and would prefer that application code was explicit about the encoding to be used and thus fail (I'm not sure I agree with this, but I don't want to diverge from upstream).

Unfortunately, this can lead to hidden bugs.  One way of ensuring better consistency between the tty/non-tty development/deployment cases is to use the PYTHONIOENCODING environment variable.  This value overrides the default encoding of sys.std[in|out|err]; in pseudocode:

if PYTHONIOENCODING set:
  encoding = PYTHONIOENCODING
else:
  if tty:
     encoding = locale # UTF-8
  else:
     encoding = ascii

so that it uses the supplied value in both cases without having this tty/non-tty inconsistency.

By setting PYTHONIOENCODING=ascii in the environment, you force Python to use ascii for the encoding of the standard streams, and thus any errors that might occur when deploying the script as a daemon/cronjob will fail immediately during development, rather than during deployment.

(Alternatively, you could set PYTHONIOENCODING=UTF-8 during both development and deployment, but that assumption needs to be clearly stated in the appliation's documentation; I haven't tested this latter approach).

Closing CANTFIX.

Note You need to log in before you can comment on or make changes to this bug.