Bug 1543575

Summary: Container deployment fails with oci runtime error applying cgroup configuration
Product: Red Hat Enterprise Linux 7 Reporter: Alan Pevec <apevec>
Component: dockerAssignee: Daniel Walsh <dwalsh>
Status: CLOSED ERRATA QA Contact: atomic-bugs <atomic-bugs>
Severity: medium Docs Contact:
Priority: urgent    
Version: 7.4CC: abregman, aburden, amurdaca, apevec, augol, bdobreli, ddarrah, derli, dprince, dwalsh, dyasny, emacchi, jcoufal, jeder, jlibosva, jschluet, lbezdick, lbopf, lfriedma, lsm5, m.andre, markmc, mbultel, mcornea, mpatel, ohochman, psuriset, racedoro, rhallise, sasha, sclewis, skatlapa, vichoudh, whayutin, yprokule
Target Milestone: rcKeywords: Extras, Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: docker-1.13.1-52.gitce62987.el7_4 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1523043 Environment:
Last Closed: 2018-03-07 09:51:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1523043    

Description Alan Pevec 2018-02-08 18:10:15 UTC
After investigation back in December in bz 1523043, it was traced down to the systemd/docker interaction.
We are now hitting this more frequently in the OpenStack upstream CI and it is blocking OSP 13 production chain.

+++ This bug was initially created as a clone of Bug #1523043 +++

TripleO CI is sometimes hitting errors when starting containers:

e.g.
 \"Error running ['docker', 'run', '--name', 'rabbitmq_image_tag', '--label', 'config_id=tripleo_step1', '--label', 'container_name=rabbitmq_image_tag', '--label', 'managed_by=paunch', '--label', 'config_data={\\"start_order\\": 1, \\"command\\": [\\"/bin/bash\\", \\"-c\\", \\"/usr/bin/docker tag \\'192.168.24.1:8787/rhosp12/openstack-rabbitmq:12.0-20171201.1\\' \\'192.168.24.1:8787/rhosp12/openstack-rabbitmq:pcmklatest\\'\\"], \\"user\\": \\"root\\", \\"volumes\\": [\\"/etc/hosts:/etc/hosts:ro\\", \\"/etc/localtime:/etc/localtime:ro\\", \\"/dev/shm:/dev/shm:rw\\", \\"/etc/sysconfig/docker:/etc/sysconfig/docker:ro\\", \\"/usr/bin:/usr/bin:ro\\", \\"/var/run/docker.sock:/var/run/docker.sock:rw\\"], \\"image\\": \\"192.168.24.1:8787/rhosp12/openstack-rabbitmq:12.0-20171201.1\\", \\"detach\\": false, \\"net\\": \\"host\\"}', '--net=host', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/dev/shm:/dev/shm:rw', '--volume=/etc/sysconfig/docker:/etc/sysconfig/docker:ro', '--volume=/usr/bin:/usr/bin:ro', '--volume=/var/run/docker.sock:/var/run/docker.sock:rw', '192.168.24.1:8787/rhosp12/openstack-rabbitmq:12.0-20171201.1', '/bin/bash', '-c', \\"/usr/bin/docker tag '192.168.24.1:8787/rhosp12/openstack-rabbitmq:12.0-20171201.1' '192.168.24.1:8787/rhosp12/openstack-rabbitmq:pcmklatest'\\"]. [125]\", 
 \"/usr/bin/docker-current: Error response from daemon: invalid header field value \\"oci runtime error: container_linux.go:247: starting container process caused \\\\"process_linux.go:258: applying cgroup configuration for process caused \\\\\\\\"write /sys/fs/cgroup/pids/system.slice/docker-0642d71adf65f90fac83693d33be8857e9b1c4a5c69254357ea04fdeadf10c49.scope/cgroup.procs: no such device\\\\\\\\"\\\\"\\n\\".\", 

--- Additional comment from Mark McLoughlin on 2017-12-13 09:11:44 EST ---

Looks similar to https://github.com/openshift/origin/issues/16246

--- Additional comment from Vikas Choudhary on 2018-01-07 16:10:19 EST ---

Here is the detailed analysis of this issue

https://github.com/openshift/origin/issues/16246#issuecomment-355852817

--- Additional comment from Emilien Macchi on 2018-01-30 13:19:47 EST ---

We are having a similar if not the same situation in TripleO gate at this time:
https://bugs.launchpad.net/tripleo/+bug/1746298

The issue is critical as it makes our jobs randomly failing and it blocks OSP13 production chain at this time. Note that oci-register-machine is already disabled.

--- Additional comment from Vikas Choudhary on 2018-02-05 00:41:48 EST ---

There are two different races. The one related to pids cgroup join, IMO, will not get work arounded by disabling oci-register-machine.

--- Additional comment from Alan Pevec on 2018-02-06 18:53:47 EST ---

(In reply to Vikas Choudhary from comment #49)
> Verified from the logs that you shared, docker in use is 1.12.6 and that is
> using runc which is at this commit:
> https://github.com/projectatomic/runc/commit/
> c5d311627d39439c5b1cc35c67a51c9c6ccda648

I also checked latest 7.4-extras-pending build docker-1.13.1-48.gitec9911e.el7_4 and "Fix race against systemd" is not included.

> Fix from opencontainers/runc,
> https://github.com/opencontainers/runc/pull/1683, is not there.  Therefore
> as i said in previous comment, to avoid this failure, mentioned fix should
> be backported to projectatomic/runc

Where do we need to file rhbz to get this into Extras quickly?
systemd bz 1532586 is approved for 7.4.z but next batch update is only March 6th.

--- Additional comment from Daniel Walsh on 2018-02-07 10:02:18 EST ---

Antonio can you see if we can get this patch back ported to docker-runc?

Comment 8 Alan Pevec 2018-02-16 13:19:41 UTC
There seems to be an issue when deploying containerized TripleO with iptables rules added in Docker 1.13

https://bugzilla.redhat.com/show_bug.cgi?id=1543580#c21

* Change the default `FORWARD` policy to `DROP` [#28257](https://github.com/docker/docker/pull/28257)

Comment 11 errata-xmlrpc 2018-03-07 09:51:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:0436