Bug 1467919

Summary: running docker containers prevents processes to use real-time scheduling when restarted
Product: Red Hat Enterprise Linux 7 Reporter: Damien Ciabrini <dciabrin>
Component: docker-latestAssignee: Mrunal Patel <mpatel>
Status: CLOSED NOTABUG QA Contact: atomic-bugs <atomic-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 7.4CC: amurdaca, bhu, chjones, dciabrin, dwalsh, fdeutsch, fdinitto, hhuang, imcleod, jeckersb, jhonce, jpokorny, lars, lsm5, mpatel, oblaut, riel, rscarazz, sasha, tcrider, ushkalim
Target Milestone: rcKeywords: Extras
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1476214 (view as bug list) Environment:
Last Closed: 2017-07-27 23:04:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1415556, 1476214    
Attachments:
Description Flags
cgroup state before running docker
none
cgroup state right after first container has run
none
cgroup state after corosync has been restarted none

Description Damien Ciabrini 2017-07-05 13:58:28 UTC
Description of problem:

We have a cluster manager process in RHEL - corosync - which is running with SCHED_RR priority.

[root@overcloud-controller-1 ~]# chrt -p $(pidof corosync)
pid 15720's current scheduling policy: SCHED_RR
pid 15720's current scheduling priority: 99

it seems that as soon as docker is used to run a container, cgroups properties are being touched/updated, and as a side effect, any restart of the corosync service will prevent the corosync process from using SCHED_RR.
 
How reproducible:

Always

Steps to Reproduce:

before running docker on the host:

[root@overcloud-controller-1 ~]# ls /sys/fs/cgroup/cpu
cgroup.clone_children  cgroup.procs          cpuacct.stat   cpuacct.usage_percpu  cpu.cfs_quota_us  cpu.rt_runtime_us  cpu.stat           release_agent
cgroup.event_control   cgroup.sane_behavior  cpuacct.usage  cpu.cfs_period_us     cpu.rt_period_us  cpu.shares         notify_on_release  tasks
[root@overcloud-controller-1 ~]# systemd-cgls > pre-docker-run.txt

1. Enabled the docker deamon, and run a container

[coot@overcloud-controller-1 ~]# systemctl start docker
[root@overcloud-controller-1 ~]# docker pull centos
Using default tag: latest
Trying to pull repository registry.access.redhat.com/centos ... 
Trying to pull repository docker.io/library/centos ... 
latest: Pulling from docker.io/library/centos
d5e46245fe40: Pull complete 
Digest: sha256:aebf12af704307dfa0079b3babdca8d7e8ff6564696882bcb5d11f1d461f9ee9
[root@overcloud-controller-1 ~]# docker run -it centos /bin/true

2. Notice how /sys/fs/group/cpu now has some systemd slices defined:

[root@overcloud-controller-1 ~]# ls /sys/fs/cgroup/cpu
cgroup.clone_children  cgroup.procs          cpuacct.stat   cpuacct.usage_percpu  cpu.cfs_quota_us  cpu.rt_runtime_us  cpu.stat       notify_on_release  system.slice  user.slice
cgroup.event_control   cgroup.sane_behavior  cpuacct.usage  cpu.cfs_period_us     cpu.rt_period_us  cpu.shares         machine.slice  release_agent      tasks
[root@overcloud-controller-1 ~]# systemd-cgls > post-docker-run.txt

But the running corosync process is not impacted yet 
[root@overcloud-controller-1 ~]# chrt -p $(pidof corosync)
pid 15720's current scheduling policy: SCHED_RR
pid 15720's current scheduling priority: 99

3. Restart corosync and see how it cannot request SCHED_RR any longer

[root@overcloud-controller-1 ~]# pcs cluster stop
Stopping Cluster (pacemaker)...
Stopping Cluster (corosync)...
[root@overcloud-controller-1 ~]# pcs cluster start
Starting Cluster...
[root@overcloud-controller-1 ~]# chrt -p $(pidof corosync)
pid 294076's current scheduling policy: SCHED_OTHER
pid 294076's current scheduling priority: 0
[root@overcloud-controller-1 ~]# systemd-cgls > post-docker-run-svc-restart.txt


Actual results:
corosync is restarted, but without real time scheduling priority

Expected results:
real time scheduling priority request should be honoured.

Additional info:
Attached dump of systemd-cgls

Comment 2 Damien Ciabrini 2017-07-05 14:07:35 UTC
Created attachment 1294628 [details]
cgroup state before running docker

Comment 3 Damien Ciabrini 2017-07-05 14:09:14 UTC
Created attachment 1294629 [details]
cgroup state right after first container has run

Comment 4 Damien Ciabrini 2017-07-05 14:10:43 UTC
Created attachment 1294630 [details]
cgroup state after corosync has been restarted

Comment 5 Antonio Murdaca 2017-07-05 14:13:46 UTC
Please provide docker version. I must assume it's docker-latest, docker pkg has not RT support afaict.

As for docker-latest (1.13.1) - I've already backported and fixed some patches to make RT cgroups honored:

https://github.com/projectatomic/docker/commit/007734e8cbc82e27d2a1b995147bce408b45fcce
https://github.com/projectatomic/docker/commit/2cd61f169108bb872e9bb02216f2ac7acdf80d9e
https://github.com/projectatomic/containerd/commit/d9fed2210bbeae0deb2a0b6ad7e0360d53facb59

and there's still also a PR upstream:

https://github.com/moby/moby/pull/33731

For RHEL, we're going to rebuild soon and the fix will probably lands in 1.13.1 docker-latest as part of 7.4 GA (probably...)

Comment 6 Antonio Murdaca 2017-07-05 14:14:42 UTC
For the broad issue, docker (1.12.6) doesn't support RT cgroups - that's why you actually need docker-latest (when patched)

Comment 7 Damien Ciabrini 2017-07-05 14:19:37 UTC
Antonio,
Oh sorry, right... i'm experiencing the issue with docker-1.12.6-39.1.git6ffd653.el7.x86_64

Comment 8 Chris Jones 2017-07-05 16:48:15 UTC
@Antonio: Are those patches about allowing containers to run with RT scheduling? To be clear, the corosync process Damien is talking about, is running outside a container on the host. It will end up being containerised eventually in OSP, but it is very likely to still need to be running on the host for OSP12, with RT scheduling.

Comment 9 Daniel Walsh 2017-07-06 10:27:12 UTC
If this works in docker-latest, then we should reassign there.

Comment 10 Jan Pokorný [poki] 2017-07-06 11:49:44 UTC
Isn't this a dupe of [bug 1425354]?

Comment 12 Chris Jones 2017-07-06 12:35:27 UTC
@Jan: That one does seem similar, although it looks like they ended up not being able to reproduce the bug with a system Docker package?

Comment 13 Lars Kellogg-Stedman 2017-07-06 12:49:05 UTC
I don't know if this helps, but the underlying issues seems to be that running docker results in the creation of new cpu cgroups that apply to services started by systemd.  Before starting docker:

  # ls /sys/fs/cgroup/cpu
  cgroup.clone_children  cgroup.sane_behavior  cpuacct.usage_percpu  cpu.rt_period_us   cpu.stat           tasks
  cgroup.event_control   cpuacct.stat          cpu.cfs_period_us     cpu.rt_runtime_us  notify_on_release
  cgroup.procs           cpuacct.usage         cpu.cfs_quota_us      cpu.shares         release_agent

But after starting docker, there are cpu controllers for systemd's system.slice and user.slice:

  # systemctl start docker
  # ls /sys/fs/cgroup/cpu
  cgroup.clone_children  cpuacct.stat          cpu.cfs_quota_us   cpu.stat           tasks
  cgroup.event_control   cpuacct.usage         cpu.rt_period_us   notify_on_release  user.slice
  cgroup.procs           cpuacct.usage_percpu  cpu.rt_runtime_us  release_agent
  cgroup.sane_behavior   cpu.cfs_period_us     cpu.shares         system.slice

(Note that /sys/fs/cgroup/cpu now contains 'system.slice' and 'user.slice' directories, which in turn apply to services started by systemd)

It's not necessary to boot a container to trigger this issue.  Simply starting docker is sufficient.

I was able hack around the problem by creating /etc/systemd/system/corosync.service.d/realtime.conf containing:

  [Service]
  ExecStartPost=/bin/sh -c "echo $MAINPID > /sys/fs/cgroup/cpu/tasks"
  ExecStartPost=/bin/chrt -r -p 99 $MAINPID

This explicitly moves corosync back into the root cgroup and then sets the scheduling priority. In older versions of systemd there were explicit configuration directives that could accomplish the same thing (see https://www.freedesktop.org/wiki/Software/systemd/MyServiceCantGetRealtime/), but these are no longer available.

Comment 14 Lars Kellogg-Stedman 2017-07-06 13:25:26 UTC
Lennart's recommened solution to the general problem is https://bugzilla.redhat.com/show_bug.cgi?id=1229700, which boils down to "disable CONFIG_RT_GROUP_SCHED because the semantics are insaaaaaaaane".

Comment 15 Rik van Riel 2017-07-06 15:56:21 UTC
What kernel version is showing this bug?

The realtime kernel already disables CONFIG_RT_GROUP_SCHED

Comment 16 Damien Ciabrini 2017-07-07 12:56:04 UTC
@Antonio, just as a confirmation, we also check that our issue happens on docker-latest
[root@overcloud-controller-0 ~]# rpm -qa docker-latest                                                                                                               docker-latest-1.13.1-19.1.git19ea2d3.el7.x86_64

[root@overcloud-controller-0 ~]# chrt -p $(pidof corosync)                                                                                                                                                         
pid 20894's current scheduling policy: SCHED_RR
pid 20894's current scheduling priority: 99
[root@overcloud-controller-0 ~]# pcs cluster stop
Stopping Cluster (pacemaker)...
Stopping Cluster (corosync)...
[root@overcloud-controller-0 ~]# pcs cluster start
Starting Cluster...
[root@overcloud-controller-0 ~]# chrt -p $(pidof corosync)                                                                                                                                                         
pid 111320's current scheduling policy: SCHED_OTHER
pid 111320's current scheduling priority: 0

So we still experience the issue with docker-latest on RHEL-7.4: whenever a new process is started on the host, it cannot get real time priority any longer.

Comment 17 Damien Ciabrini 2017-07-10 12:03:46 UTC
(In reply to Rik van Riel from comment #15)
> What kernel version is showing this bug?
> 
> The realtime kernel already disables CONFIG_RT_GROUP_SCHED

Kernel used for our OpenStack use case:

[root@overcloud-controller-0 ~]# uname -a
Linux overcloud-controller-0.novalocal 3.10.0-691.el7.x86_64 #1 SMP Thu Jun 29 10:30:04 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux

[root@overcloud-controller-0 ~]# grep CONFIG_RT_GROUP_SCHED /boot/config-3.10.0-691.el7.x86_64
CONFIG_RT_GROUP_SCHED=y

Comment 18 Chris Jones 2017-07-12 15:15:01 UTC
I think a slightly shorter version of Lars' workaround, would be:

echo 950000 > /sys/fs/cgroup/cpu,cpuacct/system.slice/cpu.rt_runtime_us

Comment 19 Lars Kellogg-Stedman 2017-07-12 15:35:10 UTC
Chris: I am pretty sure that won't work.

The service will get it's own child control group under system.slice, which will have no runtime budget.  And you can't just modify the one cgroup; from the kernel docs (https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt):

> By default all bandwidth is assigned to the root group and new groups get the
> period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you
> want to assign bandwidth to another group, reduce the root group's bandwidth
> and assign some or all of the difference to another group.

So you would need to at least modify the root rt_runtime_us allocation.

An alternative solution that doesn't involve moving the service into the root cgroup could look like:

[Service]
ExecStartPre=/bin/sh -c 'echo 550000 > /sys/fs/cgroup/cpu,cpuacct/cpu.rt_runtime_us'
ExecStartPre=/bin/sh -c 'echo 200000 > /sys/fs/cgroup/cpu,cpuacct/system.slice/cpu.rt_runtime_us'
ExecStartPre=/bin/sh -c 'echo 200000 > /sys/fs/cgroup/cpu,cpuacct/system.slice/corosync.service/cpu.rt_runtime_us'

This knocks 400000us off the root runtime allocation, and then splits the remainder between system.slice and corosync.service.

This takes advantage of the fact that systemd creates the cgroups before calling ExecStartPre, which means we can set things up before the service actually starts.

Comment 20 Jan Pokorný [poki] 2017-07-12 16:17:00 UTC
(just noticed
https://lists.freedesktop.org/archives/systemd-devel/2017-July/039306.html,
thanks for opening that discussion, Lars)

Comment 21 Fabio Massimo Di Nitto 2017-07-13 03:34:26 UTC
(In reply to Lars Kellogg-Stedman from comment #19)

> This takes advantage of the fact that systemd creates the cgroups before
> calling ExecStartPre, which means we can set things up before the service
> actually starts.

the problem here is that this approach become a per-application workaround and we don´t know how many applications are affected by the change of the host behavior.

Comment 22 Chris Jones 2017-07-26 10:57:49 UTC
Do we know what it is about starting Docker that causes systemd to change the layout of cgroups and move things out of the root slice? I'm wondering if that can somehow be avoided, mitigating this entire issue.

Comment 23 Daniel Walsh 2017-07-26 11:40:47 UTC
Mrunal do you have any idea?

Comment 24 Raoul Scarazzini 2017-07-26 17:22:54 UTC
With the intent of analyzing in dept this issue we made a comparison between OSP11 and OSP12, to check first what are the processes that might be affected by the issue, so basically the processes that runs with RR scheduler.
These are the results:

In OSP11 (kernel 3.10.0-693.el7.x86_64 CONFIG_RT_GROUP_SCHED=y):

[root@overcloud-controller-0 ~]# ps -eo pid,class,rtprio,command --sort=+class | grep [R]R
  16926 RR      99 corosync

In OSP12 (kernel 3.10.0-693.el7.x86_64 CONFIG_RT_GROUP_SCHED=y):

[root@overcloud-controller-0 ~]# ps -eo pid,class,rtprio,command --sort=+class | grep [R]R
 20396 RR      99 corosync

So basically the only process with RR scheduler is corosync, but I set a needinfo in this bug to Udi from QE just to help me in checking this on near to production environment, since this were "simple" deployments with 3 controllers and one compute, so without for example storage.

Another thing we noticed is this one: 

In OSP11 if we want to change the scheduler to RR of a different process, like haproxy, we can do it:

[heat-admin@overcloud-controller-1 ~]$ ps -eo pid,class,rtprio,command --sort=+class | grep [h]aproxy
  31051 TS       - /usr/sbin/haproxy-systemd-wrapper -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
  95580 TS       - /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds -sf 31053
  95582 TS       - /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds -sf 31053
[heat-admin@overcloud-controller-1 ~]$ chrt -p 31051
pid 31051's current scheduling policy: SCHED_OTHER
pid 31051's current scheduling priority: 0
[root@overcloud-controller-1 ~]# chrt -r -v -p 99 31051
pid 31051's current scheduling policy: SCHED_OTHER
pid 31051's current scheduling priority: 0
pid 31051's new scheduling policy: SCHED_RR
pid 31051's new scheduling priority: 99

In OSP12 this is not possible:

[root@overcloud-controller-0 ~]# ps -eo pid,class,rtprio,command --sort=+class | grep [h]aproxy
 42510 TS       - /usr/sbin/haproxy-systemd-wrapper -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid
119235 TS       - /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds -sf 42512
119237 TS       - /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds -sf 42512
[root@overcloud-controller-0 ~]# chrt -p 42510
pid 42510's current scheduling policy: SCHED_OTHER
pid 42510's current scheduling priority: 0
[root@overcloud-controller-0 ~]# chrt -r -v -p 99 42510
pid 42510's current scheduling policy: SCHED_OTHER
pid 42510's current scheduling priority: 0
chrt: failed to set pid 42510's policy: Operation not permitted

Or at least, it is possible JUST for the corosync process:

[root@overcloud-controller-0 ~]# ps -eo pid,class,rtprio,command --sort=+class | grep [c]orosync
 20396 RR      99 corosync
[root@overcloud-controller-0 ~]# chrt -r -v -p 50 20396
pid 20396's current scheduling policy: SCHED_RR
pid 20396's current scheduling priority: 99
pid 20396's new scheduling policy: SCHED_RR
pid 20396's new scheduling priority: 50
[root@overcloud-controller-0 ~]# ps -eo pid,class,rtprio,command --sort=+class | grep [c]orosync
 20396 RR      50 corosync

since it is already RR scheduler, but if we restart it then we come again in the problem described earlier in the bug:

[root@overcloud-controller-0 ~]# systemctl restart corosync
[root@overcloud-controller-0 ~]# ps -eo pid,class,rtprio,command --sort=+class | grep [c]orosync
124922 TS       - corosync
[root@overcloud-controller-0 ~]# chrt -r -v -p 50 124922
pid 124922's current scheduling policy: SCHED_OTHER
pid 124922's current scheduling priority: 0
chrt: failed to set pid 124922's policy: Operation not permitted

I understand this is not so much helpful since basically demonstrates what we already knew, but maybe these info can be somehow useful.

Comment 25 Mrunal Patel 2017-07-27 21:34:53 UTC
Do you have to be in root cgroup? Can you move to a different custom slice?

Comment 26 Ian McLeod 2017-07-27 23:04:23 UTC
(In reply to Chris Jones from comment #22)
> Do we know what it is about starting Docker that causes systemd to change
> the layout of cgroups and move things out of the root slice? I'm wondering
> if that can somehow be avoided, mitigating this entire issue.

I talked this over with Mrunal a bit this afternoon.  For background, have a look at Lennart's comments here:

https://lists.freedesktop.org/archives/systemd-devel/2017-July/039210.html

In short, the presence of these additional slices is expected behavior if CPU accounting is enabled by a systemd service.  Once this happens, newly started or restarted service end up in the system slice instead of the root slice.

We don't enable CPU Accounting explicitly in the docker unit file, but we _do_ request it from systemd via the systemd cgroupdriver in docker/runc.  Some relevant source bits:

https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L249

And the related Docker documentation:

https://docs.docker.com/engine/reference/commandline/dockerd/#options-for-the-runtime

It's not clear why you haven't seen this until now but it does not seem to be a bug or a regression in our docker package.  This is a feature.  It allows per-container resource control via cgroups.  We cannot disable or remove this behavior.  (We actually had a blocking bug opened against docker for a situation where setting these values wasn't working.)

As mentioned in the systemd thread and in the systemd documentation, it's also possible to trigger this behavior by adding CPUAccounting=True to any other active service on the system, or to the system-wide default.

Again, this is a supported and valid option in the systemd that we ship, with well documented behavior.

With all this in mind, I think we need to move forward with one of the workarounds discussed above.  Lars also helpfully fleshed out an option later on in the systemd thread:

https://lists.freedesktop.org/archives/systemd-devel/2017-July/039353.html

I've been told this is quite pressing.  To move the conversation forward, I'm going to close this bug against docker-latest as NOTABUG.  Feel free to reverse this and continue discussion if need be.

Comment 27 Chris Jones 2017-07-28 10:13:19 UTC
@Ian: I think the main reason we haven't hit this before is because the OpenStack product hasn't been containerised before, so it's pretty unlikely Docker would have been started on a controller node and triggered the cgroup changes.

I'm going to leave this issue closed, since there isn't any work to be done here in docker, and we'll clone it for corosync to implement a workaround.

Thanks!