Bug 1467919
Summary: | running docker containers prevents processes to use real-time scheduling when restarted | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Damien Ciabrini <dciabrin> | ||||||||
Component: | docker-latest | Assignee: | Mrunal Patel <mpatel> | ||||||||
Status: | CLOSED NOTABUG | QA Contact: | atomic-bugs <atomic-bugs> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 7.4 | CC: | amurdaca, bhu, chjones, dciabrin, dwalsh, fdeutsch, fdinitto, hhuang, imcleod, jeckersb, jhonce, jpokorny, lars, lsm5, mpatel, oblaut, riel, rscarazz, sasha, tcrider, ushkalim | ||||||||
Target Milestone: | rc | Keywords: | Extras | ||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 1476214 (view as bug list) | Environment: | |||||||||
Last Closed: | 2017-07-27 23:04:23 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1415556, 1476214 | ||||||||||
Attachments: |
|
Description
Damien Ciabrini
2017-07-05 13:58:28 UTC
Created attachment 1294628 [details]
cgroup state before running docker
Created attachment 1294629 [details]
cgroup state right after first container has run
Created attachment 1294630 [details]
cgroup state after corosync has been restarted
Please provide docker version. I must assume it's docker-latest, docker pkg has not RT support afaict. As for docker-latest (1.13.1) - I've already backported and fixed some patches to make RT cgroups honored: https://github.com/projectatomic/docker/commit/007734e8cbc82e27d2a1b995147bce408b45fcce https://github.com/projectatomic/docker/commit/2cd61f169108bb872e9bb02216f2ac7acdf80d9e https://github.com/projectatomic/containerd/commit/d9fed2210bbeae0deb2a0b6ad7e0360d53facb59 and there's still also a PR upstream: https://github.com/moby/moby/pull/33731 For RHEL, we're going to rebuild soon and the fix will probably lands in 1.13.1 docker-latest as part of 7.4 GA (probably...) For the broad issue, docker (1.12.6) doesn't support RT cgroups - that's why you actually need docker-latest (when patched) Antonio, Oh sorry, right... i'm experiencing the issue with docker-1.12.6-39.1.git6ffd653.el7.x86_64 @Antonio: Are those patches about allowing containers to run with RT scheduling? To be clear, the corosync process Damien is talking about, is running outside a container on the host. It will end up being containerised eventually in OSP, but it is very likely to still need to be running on the host for OSP12, with RT scheduling. If this works in docker-latest, then we should reassign there. Isn't this a dupe of [bug 1425354]? @Jan: That one does seem similar, although it looks like they ended up not being able to reproduce the bug with a system Docker package? I don't know if this helps, but the underlying issues seems to be that running docker results in the creation of new cpu cgroups that apply to services started by systemd. Before starting docker: # ls /sys/fs/cgroup/cpu cgroup.clone_children cgroup.sane_behavior cpuacct.usage_percpu cpu.rt_period_us cpu.stat tasks cgroup.event_control cpuacct.stat cpu.cfs_period_us cpu.rt_runtime_us notify_on_release cgroup.procs cpuacct.usage cpu.cfs_quota_us cpu.shares release_agent But after starting docker, there are cpu controllers for systemd's system.slice and user.slice: # systemctl start docker # ls /sys/fs/cgroup/cpu cgroup.clone_children cpuacct.stat cpu.cfs_quota_us cpu.stat tasks cgroup.event_control cpuacct.usage cpu.rt_period_us notify_on_release user.slice cgroup.procs cpuacct.usage_percpu cpu.rt_runtime_us release_agent cgroup.sane_behavior cpu.cfs_period_us cpu.shares system.slice (Note that /sys/fs/cgroup/cpu now contains 'system.slice' and 'user.slice' directories, which in turn apply to services started by systemd) It's not necessary to boot a container to trigger this issue. Simply starting docker is sufficient. I was able hack around the problem by creating /etc/systemd/system/corosync.service.d/realtime.conf containing: [Service] ExecStartPost=/bin/sh -c "echo $MAINPID > /sys/fs/cgroup/cpu/tasks" ExecStartPost=/bin/chrt -r -p 99 $MAINPID This explicitly moves corosync back into the root cgroup and then sets the scheduling priority. In older versions of systemd there were explicit configuration directives that could accomplish the same thing (see https://www.freedesktop.org/wiki/Software/systemd/MyServiceCantGetRealtime/), but these are no longer available. Lennart's recommened solution to the general problem is https://bugzilla.redhat.com/show_bug.cgi?id=1229700, which boils down to "disable CONFIG_RT_GROUP_SCHED because the semantics are insaaaaaaaane". What kernel version is showing this bug? The realtime kernel already disables CONFIG_RT_GROUP_SCHED @Antonio, just as a confirmation, we also check that our issue happens on docker-latest [root@overcloud-controller-0 ~]# rpm -qa docker-latest docker-latest-1.13.1-19.1.git19ea2d3.el7.x86_64 [root@overcloud-controller-0 ~]# chrt -p $(pidof corosync) pid 20894's current scheduling policy: SCHED_RR pid 20894's current scheduling priority: 99 [root@overcloud-controller-0 ~]# pcs cluster stop Stopping Cluster (pacemaker)... Stopping Cluster (corosync)... [root@overcloud-controller-0 ~]# pcs cluster start Starting Cluster... [root@overcloud-controller-0 ~]# chrt -p $(pidof corosync) pid 111320's current scheduling policy: SCHED_OTHER pid 111320's current scheduling priority: 0 So we still experience the issue with docker-latest on RHEL-7.4: whenever a new process is started on the host, it cannot get real time priority any longer. (In reply to Rik van Riel from comment #15) > What kernel version is showing this bug? > > The realtime kernel already disables CONFIG_RT_GROUP_SCHED Kernel used for our OpenStack use case: [root@overcloud-controller-0 ~]# uname -a Linux overcloud-controller-0.novalocal 3.10.0-691.el7.x86_64 #1 SMP Thu Jun 29 10:30:04 EDT 2017 x86_64 x86_64 x86_64 GNU/Linux [root@overcloud-controller-0 ~]# grep CONFIG_RT_GROUP_SCHED /boot/config-3.10.0-691.el7.x86_64 CONFIG_RT_GROUP_SCHED=y I think a slightly shorter version of Lars' workaround, would be: echo 950000 > /sys/fs/cgroup/cpu,cpuacct/system.slice/cpu.rt_runtime_us Chris: I am pretty sure that won't work. The service will get it's own child control group under system.slice, which will have no runtime budget. And you can't just modify the one cgroup; from the kernel docs (https://www.kernel.org/doc/Documentation/scheduler/sched-rt-group.txt): > By default all bandwidth is assigned to the root group and new groups get the > period from /proc/sys/kernel/sched_rt_period_us and a run time of 0. If you > want to assign bandwidth to another group, reduce the root group's bandwidth > and assign some or all of the difference to another group. So you would need to at least modify the root rt_runtime_us allocation. An alternative solution that doesn't involve moving the service into the root cgroup could look like: [Service] ExecStartPre=/bin/sh -c 'echo 550000 > /sys/fs/cgroup/cpu,cpuacct/cpu.rt_runtime_us' ExecStartPre=/bin/sh -c 'echo 200000 > /sys/fs/cgroup/cpu,cpuacct/system.slice/cpu.rt_runtime_us' ExecStartPre=/bin/sh -c 'echo 200000 > /sys/fs/cgroup/cpu,cpuacct/system.slice/corosync.service/cpu.rt_runtime_us' This knocks 400000us off the root runtime allocation, and then splits the remainder between system.slice and corosync.service. This takes advantage of the fact that systemd creates the cgroups before calling ExecStartPre, which means we can set things up before the service actually starts. (just noticed https://lists.freedesktop.org/archives/systemd-devel/2017-July/039306.html, thanks for opening that discussion, Lars) (In reply to Lars Kellogg-Stedman from comment #19) > This takes advantage of the fact that systemd creates the cgroups before > calling ExecStartPre, which means we can set things up before the service > actually starts. the problem here is that this approach become a per-application workaround and we don´t know how many applications are affected by the change of the host behavior. Do we know what it is about starting Docker that causes systemd to change the layout of cgroups and move things out of the root slice? I'm wondering if that can somehow be avoided, mitigating this entire issue. Mrunal do you have any idea? With the intent of analyzing in dept this issue we made a comparison between OSP11 and OSP12, to check first what are the processes that might be affected by the issue, so basically the processes that runs with RR scheduler. These are the results: In OSP11 (kernel 3.10.0-693.el7.x86_64 CONFIG_RT_GROUP_SCHED=y): [root@overcloud-controller-0 ~]# ps -eo pid,class,rtprio,command --sort=+class | grep [R]R 16926 RR 99 corosync In OSP12 (kernel 3.10.0-693.el7.x86_64 CONFIG_RT_GROUP_SCHED=y): [root@overcloud-controller-0 ~]# ps -eo pid,class,rtprio,command --sort=+class | grep [R]R 20396 RR 99 corosync So basically the only process with RR scheduler is corosync, but I set a needinfo in this bug to Udi from QE just to help me in checking this on near to production environment, since this were "simple" deployments with 3 controllers and one compute, so without for example storage. Another thing we noticed is this one: In OSP11 if we want to change the scheduler to RR of a different process, like haproxy, we can do it: [heat-admin@overcloud-controller-1 ~]$ ps -eo pid,class,rtprio,command --sort=+class | grep [h]aproxy 31051 TS - /usr/sbin/haproxy-systemd-wrapper -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid 95580 TS - /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds -sf 31053 95582 TS - /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds -sf 31053 [heat-admin@overcloud-controller-1 ~]$ chrt -p 31051 pid 31051's current scheduling policy: SCHED_OTHER pid 31051's current scheduling priority: 0 [root@overcloud-controller-1 ~]# chrt -r -v -p 99 31051 pid 31051's current scheduling policy: SCHED_OTHER pid 31051's current scheduling priority: 0 pid 31051's new scheduling policy: SCHED_RR pid 31051's new scheduling priority: 99 In OSP12 this is not possible: [root@overcloud-controller-0 ~]# ps -eo pid,class,rtprio,command --sort=+class | grep [h]aproxy 42510 TS - /usr/sbin/haproxy-systemd-wrapper -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid 119235 TS - /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds -sf 42512 119237 TS - /usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /run/haproxy.pid -Ds -sf 42512 [root@overcloud-controller-0 ~]# chrt -p 42510 pid 42510's current scheduling policy: SCHED_OTHER pid 42510's current scheduling priority: 0 [root@overcloud-controller-0 ~]# chrt -r -v -p 99 42510 pid 42510's current scheduling policy: SCHED_OTHER pid 42510's current scheduling priority: 0 chrt: failed to set pid 42510's policy: Operation not permitted Or at least, it is possible JUST for the corosync process: [root@overcloud-controller-0 ~]# ps -eo pid,class,rtprio,command --sort=+class | grep [c]orosync 20396 RR 99 corosync [root@overcloud-controller-0 ~]# chrt -r -v -p 50 20396 pid 20396's current scheduling policy: SCHED_RR pid 20396's current scheduling priority: 99 pid 20396's new scheduling policy: SCHED_RR pid 20396's new scheduling priority: 50 [root@overcloud-controller-0 ~]# ps -eo pid,class,rtprio,command --sort=+class | grep [c]orosync 20396 RR 50 corosync since it is already RR scheduler, but if we restart it then we come again in the problem described earlier in the bug: [root@overcloud-controller-0 ~]# systemctl restart corosync [root@overcloud-controller-0 ~]# ps -eo pid,class,rtprio,command --sort=+class | grep [c]orosync 124922 TS - corosync [root@overcloud-controller-0 ~]# chrt -r -v -p 50 124922 pid 124922's current scheduling policy: SCHED_OTHER pid 124922's current scheduling priority: 0 chrt: failed to set pid 124922's policy: Operation not permitted I understand this is not so much helpful since basically demonstrates what we already knew, but maybe these info can be somehow useful. Do you have to be in root cgroup? Can you move to a different custom slice? (In reply to Chris Jones from comment #22) > Do we know what it is about starting Docker that causes systemd to change > the layout of cgroups and move things out of the root slice? I'm wondering > if that can somehow be avoided, mitigating this entire issue. I talked this over with Mrunal a bit this afternoon. For background, have a look at Lennart's comments here: https://lists.freedesktop.org/archives/systemd-devel/2017-July/039210.html In short, the presence of these additional slices is expected behavior if CPU accounting is enabled by a systemd service. Once this happens, newly started or restarted service end up in the system slice instead of the root slice. We don't enable CPU Accounting explicitly in the docker unit file, but we _do_ request it from systemd via the systemd cgroupdriver in docker/runc. Some relevant source bits: https://github.com/opencontainers/runc/blob/master/libcontainer/cgroups/systemd/apply_systemd.go#L249 And the related Docker documentation: https://docs.docker.com/engine/reference/commandline/dockerd/#options-for-the-runtime It's not clear why you haven't seen this until now but it does not seem to be a bug or a regression in our docker package. This is a feature. It allows per-container resource control via cgroups. We cannot disable or remove this behavior. (We actually had a blocking bug opened against docker for a situation where setting these values wasn't working.) As mentioned in the systemd thread and in the systemd documentation, it's also possible to trigger this behavior by adding CPUAccounting=True to any other active service on the system, or to the system-wide default. Again, this is a supported and valid option in the systemd that we ship, with well documented behavior. With all this in mind, I think we need to move forward with one of the workarounds discussed above. Lars also helpfully fleshed out an option later on in the systemd thread: https://lists.freedesktop.org/archives/systemd-devel/2017-July/039353.html I've been told this is quite pressing. To move the conversation forward, I'm going to close this bug against docker-latest as NOTABUG. Feel free to reverse this and continue discussion if need be. @Ian: I think the main reason we haven't hit this before is because the OpenStack product hasn't been containerised before, so it's pretty unlikely Docker would have been started on a controller node and triggered the cgroup changes. I'm going to leave this issue closed, since there isn't any work to be done here in docker, and we'll clone it for corosync to implement a workaround. Thanks! |