Bug 1194473

Summary: UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 4: invalid start byte [
Product: Red Hat Enterprise Linux 7 Reporter: Adam Miller <admiller>
Component: pythonAssignee: Python Maintainers <python-maint>
Status: CLOSED WONTFIX QA Contact: BaseOS QE - Apps <qe-baseos-apps>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.0CC: admiller, bkabrda, bnater, mgoldman, smilner, ttomecek
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-01-11 17:59:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Adam Miller 2015-02-19 22:25:43 UTC
Description of problem:
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 4: invalid start byte [

Version-Release number of selected component (if applicable):
python-2.7.5-16.el7.x86_64

How reproducible:
Always.

Steps to Reproduce:
The steps to reproduce are mentioned in the "Notes" section here:
    https://github.com/maxamillion/docker-image-strip

I originally thought it was a bug in docker creating a malformed tarball or docker-py making a bad call, but neither panned out and this code works in Python3 so it looks as though it's a python issue. 

Actual results:
Traceback (most recent call last):
  File "./dis", line 269, in <module>
    create_chroot(working_dir, from_chroot, image_layers)
  File "./dis", line 145, in create_chroot
    layer_tar.extractall(path=chrootdir)
  File "/usr/lib64/python2.7/tarfile.py", line 2041, in extractall
    for tarinfo in members:
  File "/usr/lib64/python2.7/tarfile.py", line 2471, in next
    tarinfo = self.tarfile.next()
  File "/usr/lib64/python2.7/tarfile.py", line 2319, in next
    tarinfo = self.tarinfo.fromtarfile(self)
  File "/usr/lib64/python2.7/tarfile.py", line 1242, in fromtarfile
    return obj._proc_member(tarfile)
  File "/usr/lib64/python2.7/tarfile.py", line 1264, in _proc_member
    return self._proc_pax(tarfile)
  File "/usr/lib64/python2.7/tarfile.py", line 1394, in _proc_pax
    value = value.decode("utf8")
  File "/usr/lib64/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 4: invalid start byte
[


Expected results:
Not a traceback

Comment 2 Adam Miller 2015-02-23 14:16:11 UTC
For reference, the centos7:httpd image used to reproduce this is built from the Dockerfile found here:

https://github.com/CentOS/CentOS-Dockerfiles/tree/master/httpd/centos7

Comment 3 Bohuslav "Slavek" Kabrda 2015-02-23 15:29:22 UTC
So it seems that this fails on utf-8 decoding a PAX header value, which is not utf-8. Docker registry has a test that shows what this may look like [1].

Python 2 just assumes that PAX headers are utf-8 and fails on decoding error, while Python 3, by default, uses "surrogateescape" [2], which remembers undecodable characters as they are and encodes them properly on subsequent str.encode() call. We'll try to see if we can mimic this behaviour in Python 2, but I can't promise this is actually doable.
Note for self: it seems that docker registry (that runs on Python 2) has a test that expects this failure [3].

For now, an obvious workaround is using python33 from RHSCL.


[1] https://github.com/docker/docker-registry/blob/09e9e20cd1d3df3f1435e970c50d7d7f775045f6/tests/test_tarfile.py#L78
[2] https://docs.python.org/3.4/library/codecs.html#error-handlers
[3] https://github.com/docker/docker-registry/blob/09e9e20cd1d3df3f1435e970c50d7d7f775045f6/tests/test_tarfile.py#L23

Comment 4 Bohuslav "Slavek" Kabrda 2015-02-23 15:32:14 UTC
Additional info: the PAX header that causes this is called "SCHILY.xattr.security.capability".

Comment 5 Bohuslav "Slavek" Kabrda 2015-02-23 15:41:48 UTC
More info: Docker registry actually approaches that a bit differently - by monkeypatching _proc_pax method of tarfile - and there's a whole bug opened around that [1]. I guess we could provide a monkeypatch, that would work the same if imported in RHEL 7 Python 2.7. We'll try to see if we have more options and if not, we'll probably go ahead with this.


[1] https://github.com/docker/docker-registry/pull/381

Comment 6 Bohuslav "Slavek" Kabrda 2015-05-04 12:43:48 UTC
After rethinking this carefully and discussing with other folks at python-maint, here's the status of this bug:

- As noted above, directly patching tarfile module should not be done to preserve backwards compat and avoid any possible regressions.
- Adding a distro specific monkeypatching module into Python's stdlib for this would not be wise. We want to stay as close to vanilla Python as possible and we should not add a divergent patch (one that would only work on RHEL).

There are two systematic ways to solve this that we can see:
1) We would like to encourage everyone to move out of system Python and use RHSCL if at all possible. RHSCL has Python 3.3 which would solve your problem as noted in comment 3. Adam, would using RHSCL work for you? If not, could you specify why? Assuming you can use it, I'll close this bug as wontfix.
2) We could theoretically create a standalone package with the monkeypatch and upload it to PyPI, then we'd package it as RPM for (Fedora and) RHEL. The advantage would be that runtime environment of your package would also  be reproducible on Python 2 on other platforms, simply by installing this package. Furthermore, it'd be available for other people with similar issues to use.

I consider 2) to be a secondary solution and would very much prefer to solve this by using RHSCL as noted in 1). Adam, is this possible for you? Thanks.

Comment 7 Marek Goldmann 2015-05-13 06:22:21 UTC
I'm seeing the same issue here: https://github.com/goldmann/docker-scripts/issues/13

Comment 8 Tomas Tomecek 2016-04-12 09:08:51 UTC
I've hit this too and created upstream issue: http://bugs.python.org/issue26740

Comment 9 Steve Milner 2017-05-17 20:19:10 UTC
I believe we've hit this issue in https://bugzilla.redhat.com/show_bug.cgi?id=1451697