Bug 1339164
Summary: | HTTP Error" err="Cannot start container <hash>: [8] System error: read parent: connection reset by peer" statusCode=500 | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Laurent Rineau <laurent.rineau__fedora> | ||||
Component: | docker | Assignee: | Antonio Murdaca <amurdaca> | ||||
Status: | CLOSED ERRATA | QA Contact: | atomic-bugs <atomic-bugs> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.2 | CC: | lsm5, lsu, mpatel, pep, stwalter | ||||
Target Milestone: | rc | Keywords: | Extras | ||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause: when reading from the sync pipe between docker and libcontainer a new left was left behind unread
Consequence: failed to start containers with "error: read parent: connection reset by peer"
Fix: fix reading all bytes from the sync pipe
Result: containers can be started
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-06-23 16:18:57 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
I think the bug is from Docker, or systemd, but just in case, here is the Python script that starts the Docker containers: https://github.com/CGAL/cgal-testsuite-dockerfiles/blob/358e7e833297b1c3d2a0094e8f038320781ccd13/test_cgal.py Fixed in docker-1.10 - could you try out with docker-latest? Upstream issue: https://github.com/docker/docker/issues/14203 Otherwise the fix is https://github.com/opencontainers/runc/pull/515/commits/ddcee3cc2a2ffb3ab8c630fd62689fd14ce82e07 which could be backported to docker-1.9 (Mrunal could do it probably, a lot of conflicts after container's state refactor in libcontainer I don't know about) Well, docker is in EPEL and RHEL Extras. I use the package from EPEL instead of installing Docker myself because I trust the EPEL packagers to be better than me in the subject of the right integration of docker with the rest of the system (in particular with systemd, journald, and SELinux). So Please fix the bug in RHEL and EPEL. For the purpose of testing, and helping you fixing the bug, what would be a correct way to install docker-1.10 or later on my system, without breaking the package management and the integration with systemd, journald, and SELinux ? Is there a srpm that I could build locally? Or would it be better try to install another version of docker in /usr/local/? In that case, I know how to deal with SELinux issues, but for the integration with systemd/journald, I am not sure of the procedure. Can docker-1.9 and 1.10 share the same storage (an LVM volume in my case), if they are never run at the same time? I ONLY confirm the patch is into the docker-1.10.3-40.el7.x86_64, * Neither find a cgl account nor has enough resource to run the script (the script makes my vm's disk runs out...) If anyone get a chance to trigger the problem again, feels free open the bug Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1274 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |
Created attachment 1160968 [details] Extraction of logs from journalctl Description of problem: == TL;DR == One image is started every day, along with others, and randomly, 4 days out of 20 days, Docker logs said: level=error msg="HTTP Error" err="Cannot start container d48905e2cf7edf696e0bbc23a99ab93cb8d31a8055ea7b2c2f9b04e30a5a754a: [8] System error: read parent: connection reset by peer" statusCode=500 I attached the result of: sudo journalctl -b SYSLOG_IDENTIFIER=kernel \+ SYSLOG_IDENTIFIER=systemd \+ _EXE=/usr/bin/forward-journald --since '2016-05-24 00:19' --until '2016-05-24 00:20' that is the logs of docker/systemd/kernel around the last occurrence of the issue, today (with obfuscation of user id, hostname, and email address, each replaced by one random string). Docker was turned to debug mode recently to help reporting this issue. == Longer explanation == The CGAL open source project has a python script that start/stop containers to run CPU-intensive tests. Three containers are used in parallel. Every days, about 20 different images are tested. In about four weeks, there has been four highly unreproductible incidents. That is unreproducible because most of the days, all images have been tested successfully, but four days. I have tried several version of Docker. The version docker-1.8.2-10.el7.centos.x86_64 is fine, but I got the issue with: docker-1.9.1-25.el7.centos.x86_64 and docker-1.9.1-40.el7.centos.x86_64 As you can see, my system is CentOS 7, and not RHEL 7. Version-Release number of selected component (if applicable): I show here the latest tested version: cgal ~ $ docker version Client: Version: 1.9.1 API version: 1.21 Package version: docker-common-1.9.1-40.el7.centos.x86_64 Go version: go1.4.2 Git commit: ab77bde/1.9.1 Built: OS/Arch: linux/amd64 Server: Version: 1.9.1 API version: 1.21 Package version: docker-common-1.9.1-40.el7.centos.x86_64 Go version: go1.4.2 Git commit: ab77bde/1.9.1 Built: OS/Arch: linux/amd64 cgal ~ $ rpm -qa \*docker\* python-docker-py-1.7.2-1.el7.noarch docker-common-1.9.1-40.el7.centos.x86_64 docker-forward-journald-1.9.1-40.el7.centos.x86_64 docker-1.9.1-40.el7.centos.x86_64 docker-selinux-1.9.1-40.el7.centos.x86_64 How reproducible: I cannot reproduce it. I just have to wait a few days.