Bug 1514511 - systemd-machined can stop the docker.service
Summary: systemd-machined can stop the docker.service
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: oci-register-machine
Version: 7.4
Hardware: All
OS: Linux
Target Milestone: pre-dev-freeze
: 7.5
Assignee: Lokesh Mandvekar
QA Contact: atomic-bugs@redhat.com
Depends On:
TreeView+ depends on / blocked
Reported: 2017-11-17 16:03 UTC by Michele Baldessari
Modified: 2018-01-25 12:57 UTC (History)
22 users (show)

Fixed In Version: oci-register-machine-0-3.14.gitcd1e331.el7_4
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2018-01-25 12:57:03 UTC
Target Upstream Version:

Attachments (Terms of Use)
journal log from the last boot, where docker stopped unexpectedly (7.53 MB, application/x-bzip)
2017-11-22 09:21 UTC, Damien Ciabrini
no flags Details

System ID Private Priority Status Summary Last Updated
OpenStack gerrit 522388 0 None MERGED Set disable:true for oci-machine-register 2020-08-30 06:47:52 UTC
Red Hat Product Errata RHBA-2018:0196 0 normal SHIPPED_LIVE oci-register-machine bug fix update 2018-01-25 17:24:23 UTC

Description Michele Baldessari 2017-11-17 16:03:06 UTC
Description of problem:
It seems that under certain conditions (it's a frequently reproducible race, more on this below) after a container errors out, systemd-machined wrongfully invokes a stop on the docker.service unit (as opposed to the usual docker-<UUID>.scope unit), hence stopping docker and all other containers completely.

In RH OpenStack 12 we moved to docker containers for most of our services. Some of these containers are managed directly via docker. Some are managed by pacemaker which invokes docker to start them. We observed the following race after we restart some nodes:
1) Hard node reboot
2) At boot docker starts and it starts a bunch of containers (~50 or so) automatically (all of them have RestartPolicy “always” with unlimited maximum retries)
3) Sometimes what we see happening is that after the reboot, when a container stops it triggers the following [1] actions:
Nov 17 14:16:42 controller-2 systemd-machined[4873]: Machine c8a401499ab88f44796ced1216efc76c terminated. FOOBAR: Unit file docker.service
Nov 17 14:16:42 controller-2 systemd[1]: Stopping Docker Application Container Engine...

This causes a major disruption because the containerized OpenStack services will not start since docker has been stopped.

Sosreport can be found here: http://file.rdu.redhat.com/~mbaldess/systemd-docker/sosreport-controller-2-20171117144151.tar.bz2

Un-tarred sosreport files here:

[1] Note we installed a systemd package with the following debugging patch applied:
diff --git a/src/machine/machine.c b/src/machine/machine.c
index 223eb0f3641b..9cfc43a4d1b0 100644
--- a/src/machine/machine.c
+++ b/src/machine/machine.c
@@ -429,7 +429,7 @@ int machine_stop(Machine *m) {
                            "NAME=%s", m->name,
                            "LEADER="PID_FMT, m->leader,
-                           LOG_MESSAGE("Machine %s terminated.", m->name),
+                           LOG_MESSAGE("Machine %s terminated. FOOBAR: Unit file %s", m->name, m->unit),
         /* Kill cgroup */

Version-Release number of selected component (if applicable):

NB: in the sosreport you will see the systemd-219-42.fml.1.el7_4.4.x86_64 package which only has [1] applied on top of the original package.

How reproducible:
Very often, but not all the time.

Comment 3 Lukáš Nykrýn 2017-11-21 13:05:12 UTC
This is just a hunch, but I think that this might be a bug in docker.

When I start docker I see that it uses RegisterMachine dbus method:
"RegisterMachine() is similar to CreateMachine(), however only registers a machine, but does not create a scope unit for it. The caller's unit will be registered instead."

So the unit is determined from the cgroup of leader process:

I don't know how this is implemented in docker, so let's reassign this to them, with a question:
Could there be a race between docker starting the leader process, creating a scope and registering the process in machined?

Comment 4 Daniel Walsh 2017-11-21 13:37:39 UTC
The register machine code is called within runc, outside of docker.  It should be happening after PID1 was created inside of the container, but not running.  Could this be happening, if docker is not using systemd for managing the cgroup, so it gets confused?

Comment 5 Lukáš Nykrýn 2017-11-21 13:53:24 UTC
To be more precise.
libcontainer calls StartTransientUnit method
oci-register-machine calls RegisterMachine method

Could there be a race between those two calls?

Comment 6 Michal Sekletar 2017-11-21 14:27:07 UTC
StartTransientUnit is asynchronous. Return value is DBus object path and one needs to subscribe to DBus signal notifications about that object. Once job is completed, signal with result is sent out by systemd. StartTransientUnit from go-systemd actually allows to wait for job completion (by supplying channel where you later wait for the message containing job result [1]).

However, libcontainer currently doesn't pass any channel into StartTransientUnit [2]. Hence code called after StartTransientUnit is racing with systemd and systemd-machined may still see the container PID as part of docker.service, because transient unit doesn't exist yet and PID is not yet migrated into new cgroup.

[1] https://github.com/coreos/go-systemd/blob/master/dbus/methods.go#L136
[2] https://github.com/docker/libcontainer/blob/master/cgroups/systemd/apply_systemd.go#L201

Comment 7 Marian Krcmarik 2017-11-21 14:28:43 UTC
This may NOT be related but I can see often that some containers exit/fail to start on something like:
Nov 18 01:57:57 controller-1 dockerd-current[2664]: time="2017-11-18T01:57:56.990514935Z" level=error msg="Create container failed with error: invalid header field value \"oci runtime error: container_linux.go:247: starting container process caused \\\"process_linux.go:258: applying cgroup configuration for process caused \\\\\\\"open /sys/fs/cgroup/pids/system.slice/docker-20152b32aea39235a8a0d2ed7b3559456c3e3aa6ccd201128ec9f5bccdb69006.scope/cgroup.procs: no such file or directory\\\\\\\"\\\"\\n\""

The docker service is up and running in such case, sometimes I can see as described in bug description - docker service is stopped, so all containers are stopped.

Just mentioning since cgroups were mentioned.

Comment 8 Daniel Walsh 2017-11-21 14:45:34 UTC
Mrunal could this be a bug in runc?

Comment 10 Damien Ciabrini 2017-11-22 09:19:54 UTC
We had a reoccurrence today on our environment, so I'll add some details probably worth mentionning:

* Our machines are VM, configured with debug logs enabled in docker and systemd
[root@controller-1 ~]# cat /etc/docker/daemon.json 
[root@controller-1 ~]# cat /etc/systemd/system.conf | grep debug

* Like previously, one of the controller was rebooted ungracefully, via "echo c > /proc/sysrq-trigger"

* On restart, one container (75568d917f5be14168ef2b961825e834118e795296457d65eb9de5092ebcd36b) 
seemed to have caused the stop of docker service.

I don't know if it's worth noting, during restart, but in the journal libcontainer complained about client being "out of sync":

Nov 21 23:31:18 controller-1 dockerd-current[2688]: time="2017-11-21T23:31:18.805758013Z" level=debug msg="Loaded container 75568d917f5be14168ef2b961825e834118e795296457d65eb9de5092ebcd36b"
Nov 21 23:31:19 controller-1 dockerd-current[2688]: time="2017-11-21T23:31:19.318214148Z" level=warning msg="libcontainerd: client is out of sync, restore was called on a fully synced contai
Nov 21 23:31:19 controller-1 dockerd-current[2688]: time="2017-11-21T23:31:19.318729253Z" level=warning msg="failed to cleanup ipc mounts:\nfailed to umount /var/lib/docker/containers/75568d
Nov 21 23:31:22 controller-1 dockerd-current[2688]: time="2017-11-21T23:31:22.202612037Z" level=info msg="Removing stale sandbox 78f3d3aee2d232ceef89da0756ab58be14e326ba48c30b3698678950c70d7
Nov 21 23:31:26 controller-1 dockerd-current[2688]: time="2017-11-21T23:31:26.633063633Z" level=debug msg="Starting container 75568d917f5be14168ef2b961825e834118e795296457d65eb9de5092ebcd36b
Nov 21 23:31:28 controller-1 dockerd-current[2688]: time="2017-11-21T23:31:28.268646949Z" level=debug msg="createSpec: cgroupsPath: system.slice:docker:75568d917f5be14168ef2b961825e834118e79
Nov 21 23:31:29 controller-1 systemd[1]: Started docker container 75568d917f5be14168ef2b961825e834118e795296457d65eb9de5092ebcd36b.
Nov 21 23:31:29 controller-1 systemd[1]: Starting docker container 75568d917f5be14168ef2b961825e834118e795296457d65eb9de5092ebcd36b.
Nov 21 23:31:30 controller-1 oci-register-machine[5433]: 2017/11/21 23:31:30 Register machine: prestart 75568d917f5be14168ef2b961825e834118e795296457d65eb9de5092ebcd36b 5216 /var/lib/docker/
Nov 21 23:31:30 controller-1 systemd-machined[4105]: New machine 75568d917f5be14168ef2b961825e834.

* Then when the aforementionned controller had too many consecutive restart failure, systemd got called to stop what is obviously the wrong service (docker itself):

Nov 21 23:33:18 controller-1 dockerd-current[2688]: time="2017-11-21T23:33:18.037365080Z" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:\"start-process\", Id:\
Nov 21 23:33:18 controller-1 dockerd-current[2688]: time="2017-11-21T23:33:18.037473326Z" level=debug msg="libcontainerd: event unhandled: type:\"start-process\" id:\"75568d917f5be14168ef2b9
Nov 21 23:33:18 controller-1 dockerd-current[2688]: time="2017-11-21T23:33:18.239694897Z" level=debug msg="containerd: process exited" id=75568d917f5be14168ef2b961825e834118e795296457d65eb9d
Nov 21 23:33:18 controller-1 dockerd-current[2688]: time="2017-11-21T23:33:18.240014402Z" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:\"exit\", Id:\"75568d91
Nov 21 23:33:28 controller-1 systemd[1]: Stopping Docker Application Container Engine...
Nov 21 23:33:28 controller-1 dockerd-current[2688]: time="2017-11-21T23:33:28.173239468Z" level=debug msg="stopping 75568d917f5be14168ef2b961825e834118e795296457d65eb9de5092ebcd36b"
Nov 21 23:33:29 controller-1 dockerd-current[2688]: time="2017-11-21T23:33:28.173298775Z" level=debug msg="Sending 15 to 75568d917f5be14168ef2b961825e834118e795296457d65eb9de5092ebcd36b"
Nov 21 23:33:30 controller-1 oci-register-machine[22832]: 2017/11/21 23:33:30 Register machine: poststop 75568d917f5be14168ef2b961825e834118e795296457d65eb9de5092ebcd36b 0 /var/lib/docker/ov
Nov 21 23:33:29 controller-1 systemd[1]: Got cgroup empty notification for: /system.slice/docker-75568d917f5be14168ef2b961825e834118e795296457d65eb9de5092ebcd36b.scope
Nov 21 23:33:29 controller-1 systemd[1]: docker-75568d917f5be14168ef2b961825e834118e795296457d65eb9de5092ebcd36b.scope: cgroup is empty
Nov 21 23:33:29 controller-1 systemd[1]: docker-75568d917f5be14168ef2b961825e834118e795296457d65eb9de5092ebcd36b.scope changed running -> dead
Nov 21 23:33:29 controller-1 systemd[1]: Collecting docker-75568d917f5be14168ef2b961825e834118e795296457d65eb9de5092ebcd36b.scope
Nov 21 23:33:30 controller-1 systemd[1]: Failed to load configuration for docker-75568d917f5be14168ef2b961825e834118e795296457d65eb9de5092ebcd36b.scope: No such file or directory

* The last log seems interesting: there's no such scope for docker-75568d917f5be14168ef2b961825e834118e795296457d65eb9de5092ebcd36b.scope in /var/run/systemd/machines. 

Instead the only scope symlink that [root@controller-1 ~]# ls -l /var/run/systemd/machines/ | grep 75568d917f5be14168ef2b961825e834
-rw-r--r--. 1 root root 326 Nov 22 00:00 75568d917f5be14168ef2b961825e834
lrwxrwxrwx. 1 root root  32 Nov 22 00:00 unit:docker.service -> 75568d917f5be14168ef2b961825e834

so this "unit:docker.service" symlink is probably the reason why systemd is stopping docker instead of the failing container.

I'm attaching journal of the machine in case it brings any value

Comment 11 Damien Ciabrini 2017-11-22 09:21:18 UTC
Created attachment 1357319 [details]
journal log from the last boot, where docker stopped unexpectedly

Comment 18 Daniel Walsh 2017-11-22 18:44:32 UTC
Simple fix 

edit /etc/oci-register-machine and change the disable flag to "true"

Comment 27 Dan Prince 2017-11-22 20:58:26 UTC
I've pushed an upstream fix to puppet-tripleo here for OSP12:

https://review.openstack.org/522388 Set disable:true for oci-machine-register

Comment 28 Dan Prince 2017-11-22 21:31:30 UTC
Also pushed the fix downstream for OSP12 testing:

https://code.engineering.redhat.com/gerrit/124023 Set disable:true for oci-machine-register

Comment 29 Fabio Massimo Di Nitto 2017-11-23 03:47:07 UTC
sorry for the newbie question here but, aside from working around this race condition, what does changing the default implies in terms of behavior of the system? there has to be a reason why we register the containers with systemd-machined in the first place.

Comment 30 Lukáš Nykrýn 2017-11-23 09:17:44 UTC
Also the proper fix here should not be that hard. You just need to make the start of scope synchronous or wait for it in oci-registermachine. Michal already posted here the relevant parts of the code.

Comment 32 Daniel Walsh 2017-11-23 11:40:39 UTC
Register machine is mainly used for systemd based containers, so you can see the log/journal information on the host without entering the container. 

Lukáš , Mrunal will work on the proper fix, but I believe it has to go into runc which involves processing it through the upstream.  This bug has existed for a while, and for people who do not need RegisterMachine, it is easy to turn off while we go through the upstream process.

We should clone this bug on runc to make sure the true fix gets in.

Comment 33 Red Hat Bugzilla Rules Engine 2017-11-23 14:39:34 UTC
This bugzilla has been removed from the release since it has not been triaged, and needs to be reviewed for targeting another release.

Comment 41 errata-xmlrpc 2018-01-25 12:57:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.