Bug 1139223
Summary: | /sys/fs/cgroup/cpu,cpuacct/machine.slice disappears after systemctl daemon-reload | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Erik van Pienbroek <erik-fedora> | ||||||
Component: | systemd | Assignee: | systemd-maint | ||||||
Status: | CLOSED ERRATA | QA Contact: | Branislav Blaškovič <bblaskov> | ||||||
Severity: | urgent | Docs Contact: | |||||||
Priority: | urgent | ||||||||
Version: | 7.0 | CC: | agiorgio, alitke, ar1666, bblaskov, berrange, bmcclain, dfediuck, dyuan, equinox-redhatbugz, fabrice, fkrska, gfa, igeorgex, jdenemar, jsvarova, kchamart, lagarcia, lnykryn, mdorman, mhayden, msekleta, ovasik, psklenar, rbalakri, redhat.bugzilla, redhat, smolin, systemd-maint-list, toni.peltonen | ||||||
Target Milestone: | rc | Keywords: | ZStream | ||||||
Target Release: | --- | Flags: | agiorgio:
needinfo-
|
||||||
Hardware: | Unspecified | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | systemd-208-20.el7 | Doc Type: | Bug Fix | ||||||
Doc Text: |
Prior to this update, cgroup trees with multiple writers were insufficiently handled, and, as a consequence, the systemd daemon removed cgroups created by the libvirtd daemon. To fix this bug, a new option, Delegate, controlling creation of cgroup subhierarchies has been implemented which turns on resource delegation. This option defaults to "on" for machine scopes created by the systemd-machined daemon. As a result, cgroups created by libvirtd are no longer removed.
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 1178848 1179715 (view as bug list) | Environment: | |||||||
Last Closed: | 2015-03-05 11:11:40 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1035038, 1156399, 1175234, 1178848, 1179715 | ||||||||
Attachments: |
|
Description
Erik van Pienbroek
2014-09-08 12:35:11 UTC
Note for testing - the cgroup does not appear to get deleted immediately after 'systemctl daemon-reload'. There seems to be a delay of at least 60 seconds, sometimes longer, before it gets deleted. I can reproduce this on RHEL-7 with systemd-208-12.el7.x86_64, but I cannot reproduced on Fedora 20 with systemd-208-22.fc20.x86_64. So in my testing at least, this bug seems to be RHEL specific in some way. I can reproduce the problem on RHEL7 with systemd-208-11.el7_0.2.x86_64. *** Bug 1140215 has been marked as a duplicate of this bug. *** Cgroups disappear after some (unrelated) service is restarted after daemon-reload. (cf. linked bug 1140215) Also, this is a SECURITY ISSUE because resource constraints get un-applied(!!) And it's made particularly bad by the fact that there's no initial notice of this - applied limits just disappear without any message printed or such, it will only become visible on trying to perform operations! This bug needs more investigation, moving against 7.2. I have a similar issue (bug 1158154) where libvirt creates cgroups under /sys/fs/group/ and they disappear immediately. Libvirt shows the creation of the cgroups as successful in its logs but then they're gone from the /sys/fs/cgroup directory when you look for them. My environment is a bit different, though. I'm on F21 with systemd 216. According to Lennart's answer this probably needs to be fixed in libvirt http://lists.freedesktop.org/archives/systemd-devel/2014-October/024208.html How will fixing it in libvirt solve the issue that occurs when using only lxc without libvirt? (I'm seeing the issue on pure lxc) I am not sure what the proper component is, please feel free to reassign this to the correct one. AFAICS, it's a systemd bug - systemd shouldn't touch cgroups it didn't create, including removing processes from such cgroups. If the admin assigned a process into a different cgroup, that's the admin's decision that should be honored by systemd? I can't edit the component value btw (no permissions I assume.) I reassigned this to systemd, since David is correct. If I create a virtual server via virsh, it functions properly until I cause a systemd reload. There's something in that code path that is wiping the CPU share directories. One workaround we've found is to avoid issuing "systemctl daemon-reload" at all costs when virtual servers are running. Of course, this makes installing new system daemons a tricky (and hazardous!) operation. Have you read the mentioned email? We also thought that this should be fixed on systemd side, and Michal wrote a patch for that", but this is unacceptable for upstream and Lennart mentioned how this could be solved on the other side. I just read the link, and that's a very interesting solution. He's essentially saying that RHEL should take action to prevent the systemd cleanup behavior from occurring in the first place. I did read the link before asking "how does fixing libvirt resolve the issue for lxc?" There is no "other side" to fix, the "other side" is my brain, because as an admin I occasionally create cgroups to put stuff in when I want to see how much CPU, memory & IO a particular group of processes use. What whould be the systemd approved way of doing that, for me as an admin? Maybe libvirt & lxc can use that? > What whould be the systemd approved way of doing that, for me as an admin?
> Maybe libvirt & lxc can use that?
As it is mentioned there, if you are creating a cgroup make sure that there is always a process in it and expect that if there is none that the cgroup could be cleaned away.
(In reply to Lukáš Nykrýn from comment #16) > > What whould be the systemd approved way of doing that, for me as an admin? > > Maybe libvirt & lxc can use that? > > As it is mentioned there, if you are creating a cgroup make sure that there > is always a process in it and expect that if there is none that the cgroup > could be cleaned away. There seems to have been a misunderstanding, systemd is touching non-empty cgroups as well. I can't find a reference to the cgroup being empty in the original report? sample session: [root@ειδωλον]: ~ # systemctl --version systemd 216 +PAM -AUDIT +SELINUX +IMA -APPARMOR +SMACK -SYSVINIT +LIBCRYPTSETUP +GCRYPT -GNUTLS +ACL +XZ -LZ4 +SECCOMP +BLKID -ELFUTILS +KMOD +IDN [root@ειδωλον]: ~ # cd /sys/fs/cgroup/cpuacct [root@ειδωλον]: /sys/fs/cgroup/cpuacct # mkdir test [root@ειδωλον]: /sys/fs/cgroup/cpuacct # cd test [root@ειδωλον]: /sys/fs/cgroup/cpuacct/test # echo $$ 312384 [root@ειδωλον]: /sys/fs/cgroup/cpuacct/test # echo $$ > tasks [root@ειδωλον]: /sys/fs/cgroup/cpuacct/test # cat tasks 312384 312427 [root@ειδωλον]: /sys/fs/cgroup/cpuacct/test # systemctl daemon-reload [root@ειδωλον]: /sys/fs/cgroup/cpuacct/test # cat tasks 312384 312474 [root@ειδωλον]: /sys/fs/cgroup/cpuacct/test # systemctl restart apache2.service [root@ειδωλον]: /sys/fs/cgroup/cpuacct/test # cat tasks [root@ειδωλον]: /sys/fs/cgroup/cpuacct/test # It only seems to happen on the first systemd command after a daemon-reload. Aaaah, there seem to be 2 closely related but not identical bugs here. #1: systemd deletes cgroups that look like they were created by systemd, but in fact weren't. (e.g. machine.slice/... for libvirt) #2: systemd reassigns processes to other cgroups, even if they've been explicitly assigned to a cgroup somewhere else outside systemd's hierarchy. I would guess it happens through the same mechanism, systemd somewhere redoing the entire cgroup hierarchy after a reload... David, I have seen #1 in practice several times across ~ 200 servers running Fedora 21 and systemd 216 (and 217) but I haven't seen #2. I've had hundreds of VM's actively running and I come back later to find the majority of them missing cgroups that were set by libvirt during the VM's startup. (In reply to Major Hayden from comment #20) > David, > > I have seen #1 in practice several times across ~ 200 servers running Fedora > 21 and systemd 216 (and 217) but I haven't seen #2. I've had hundreds of > VM's actively running and I come back later to find the majority of them > missing cgroups that were set by libvirt during the VM's startup. Well, yeah, to see #2, you need to use a cgroup whose name doesn't start with machine.slice, e.g. doesn't fit the systemd naming scheme. (My terminal session above demonstrates the steps to reproduce it.) What does systemd do if a cgroup under machine.slice is non-empty? It only deletes empty ones but leaves used ones untouched? David -- correct. Here's an example: https://gist.github.com/major/a8d6a068bd1fdceef97c A freshly rebooted VM looks fine. I can query systemd and it shows that the VM is up. However, if I let it sit for a few minutes, some cgroups will be silently removed, INCLUDING block device ACL's applied in /sys/fs/devices. Removing cgroups that contain devices.allow/devices.deny seems like a serious security problem. I haven't done extensive testing with hundreds of VMs, but in my experience I've only seen the CPU share directories vanish after a systemd reload. If I just leave the system alone, they tend to hang around. After digging in with strace tonight, I'm seeing this on every system I have with VM's running: https://gist.github.com/major/3b97dfdc6e78868b97ae In the first few lines, systemd is attempting to remove the blkio, memory and devices cgroups for an actively running virtual machine. For some reason, it can't remove them on the first try but it is able to remove them later on (see the last three lines of the gist). That's causing issues with OpenStack's nova-volume and it also removes cgroups which are critical to the security of the system. I'm still sifting through the systemd code to find the source of the aggressive cgroup trimming. The problem might be related to systemd-machined stopping at some point after the VM starts. The cgroups are removed with cg_trim() within 5 minutes. On a Fedora 20 server running systemd-208-22, systemd-machined stays running as long as a virtual machine is running on the host. Does anyone know if this is being actively worked on? This is a blocker for me getting to the Juno release of OpenStack on el7. With the libvirt cgroups being removed, we can't do quota enforcement nor metrics collection. There is this patch ( http://lists.freedesktop.org/archives/systemd-devel/2014-September/023276.html ), but it only works on systemd >= 210. el7 has 208, so I can't just easily add the patch and rebuild from srpm. Thanks. It is still in development. Mentioned patch was not accepted and this was supposed to be fixed by new Delegate option[1]. But unfortunately this is not working properly[2]. [1] http://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#Delegate= [2] http://lists.freedesktop.org/archives/systemd-devel/2014-November/025607.html Our issue is similar to comment 18. Cgroups disappear after systemd is reloaded and some other systemd service has an action performed against it. Here are the following logs [1] - you can see that 1 guest is active which is instance-00000631. I execute the commands in [2] and I immediatly start seeing I/O errors and cgroups have been deleted. Someone else mentioned in a similar bug that they are seeing sigterms in systemd-machined and to disable watchdog. Watchdog is not enabled for systemd-machined [3] and we are not seeing any errors around that service/dbus when the cgroups are removed. [1] https://gist.github.com/krislindgren/68ea90079ff10d704c63#file-log-L100-L137 [2] https://gist.github.com/krislindgren/68ea90079ff10d704c63#file-commands [3] https://gist.github.com/krislindgren/68ea90079ff10d704c63#file-systemd-machined-service I have a backport of upstream fix ready. There is still one issue which will need to be fixed. Following line appears in journal, Failed to reset devices.list on /machine.slice: Invalid argument I am not sure yet, how important this is. Probably not much, since I am running patched version on my workstation for a week or so and except the mentioned log line in journal everything works as expected. I'd be happy to take a copy of said fix and apply it to our internal systems. Created attachment 970602 [details]
0001-core-introduce-new-Delegate-yes-no-property-controll
Created attachment 970603 [details]
0002-core-don-t-migrate-PIDs-for-units-that-may-contain-s
Here are test rpms https://msekleta.fedorapeople.org/systemd-1139223/ I'm getting dependency errors with the above RPMs. Is there something special I need to do in order to install them? This is on a minimal x86_64 RHEL7 VM. I just did some testing with the RPMs mentioned in comment 34 and a quick test has shown that it resolves the disappearing cgroups issue for me. Here's what I've done: - Install the RPMs from comment 34 - Restart vdsm and oVirt services - Restart all VMs where vdsm was complaining about missing cgroups - Verified that the systems logs are clean (no errors any more about missing cgroups) - Performed a systemctl daemon-reload - Waited for a couple of minutes - Verified system logs again, they're still clean (In reply to Anthony Giorgio from comment #35) > I'm getting dependency errors with the above RPMs. Is there something > special I need to do in order to install them? This is on a minimal x86_64 > RHEL7 VM. It shouldn't be necessary to do anything in order to install them. Can you post the error message generated by dependency errors? Looks like I was missing the glib2-devel RPM. Once I installed that, the test RPMs installed properly. I'm going to run some test scenarios now and see if our problem is resolved. The test RPMs look reasonable after a quick smoke test on my x86_64 VM. I did the following: * Started a KVM virtual machine via virsh * Performed a "systemctl daemon-reload" * Ran some test cases that use libvirt to get and set CPU shares on the VM All of the above appear to work, but I'm seeing the following message in my journal: libvirtd[1189]: End of file while reading data: Input/output error I'm not sure what exact operation is causing this yet. (In reply to Anthony Giorgio from comment #40) > The test RPMs look reasonable after a quick smoke test on my x86_64 VM. I > did the following: > > * Started a KVM virtual machine via virsh > * Performed a "systemctl daemon-reload" > * Ran some test cases that use libvirt to get and set CPU shares on the VM > > All of the above appear to work, but I'm seeing the following message in my > journal: > > libvirtd[1189]: End of file while reading data: Input/output error > > I'm not sure what exact operation is causing this yet. Note some of the issues appear only when you do "systemctl restart <other service>" while containers are running. I.e. during your third step, you could run a loop that restarts some unrelated service every second or so. Okay then - I ran the following: while true; do systemctl status jexec.service; sleep 1; done Then I ran through my test cases. I was able to get and set CPU shares for my VM without issue. I checked my journal, and I saw the message mentioned above: Failed to reset devices.list on /machine.slice: Invalid argument Also, I only see the message about the libvirt EOF when I run my testcase that gets/sets CPU shares. I just tried the s390x packages on a zLinux test system, and I see similar behavior. I am able to successfully get/set CPU shares, and the same journal messages are present. > Then I ran through my test cases. I was able to get and set CPU shares for > my VM without issue. I checked my journal, and I saw the message mentioned > above: > > Failed to reset devices.list on /machine.slice: Invalid argument Yep we also need http://cgit.freedesktop.org/systemd/systemd/commit/?id=714e2e1d56b97dcf2ebae2d0447b48f21e38a600 But this is only a cosmetic issue. I don't think it deserves a respin for 7.1, especially in such late phase, but it should be mentioned in release notes though. Lets deal with that message in separate bug https://bugzilla.redhat.com/show_bug.cgi?id=1178848 Old package: :: [ 13:43:45 ] :: Package versions: :: [ 13:43:45 ] :: systemd-208-11.el7.x86_64 :: [ BEGIN ] :: Running 'virsh create tester.xml' Domain kvm1 created from tester.xml :: [ PASS ] :: Command 'virsh create tester.xml' (Expected 0, got 0) :: [ BEGIN ] :: Running 'ls /sys/fs/cgroup/cpu,cpuacct/machine.slice' cgroup.clone_children cgroup.procs cpuacct.usage cpu.cfs_period_us cpu.rt_period_us cpu.shares machine-qemu\x2dkvm1.scope tasks cgroup.event_control cpuacct.stat cpuacct.usage_percpu cpu.cfs_quota_us cpu.rt_runtime_us cpu.stat notify_on_release :: [ PASS ] :: Command 'ls /sys/fs/cgroup/cpu,cpuacct/machine.slice' (Expected 0, got 0) :: [ BEGIN ] :: Running 'systemctl daemon-reload' :: [ PASS ] :: Command 'systemctl daemon-reload' (Expected 0, got 0) :: [ BEGIN ] :: Running 'systemctl restart chronyd' :: [ PASS ] :: Command 'systemctl restart chronyd' (Expected 0, got 0) :: [ 13:43:48 ] :: This directory should be still available :: [ BEGIN ] :: Running 'ls /sys/fs/cgroup/cpu,cpuacct/machine.slice' ls: cannot access /sys/fs/cgroup/cpu,cpuacct/machine.slice: No such file or directory :: [ FAIL ] :: Command 'ls /sys/fs/cgroup/cpu,cpuacct/machine.slice' (Expected 0, got 2) New package: :: [ LOG ] :: Package versions: :: [ LOG ] :: systemd-208-20.el7.x86_64 :: [ BEGIN ] :: Running 'virsh create tester.xml' Domain kvm1 created from tester.xml :: [ PASS ] :: Command 'virsh create tester.xml' (Expected 0, got 0) :: [ BEGIN ] :: Running 'ls /sys/fs/cgroup/cpu,cpuacct/machine.slice' cgroup.clone_children cgroup.procs cpuacct.usage cpu.cfs_period_us cpu.rt_period_us cpu.shares machine-qemu\x2dkvm1.scope tasks cgroup.event_control cpuacct.stat cpuacct.usage_percpu cpu.cfs_quota_us cpu.rt_runtime_us cpu.stat notify_on_release :: [ PASS ] :: Command 'ls /sys/fs/cgroup/cpu,cpuacct/machine.slice' (Expected 0, got 0) :: [ BEGIN ] :: Running 'systemctl daemon-reload' :: [ PASS ] :: Command 'systemctl daemon-reload' (Expected 0, got 0) :: [ BEGIN ] :: Running 'systemctl restart chronyd' :: [ PASS ] :: Command 'systemctl restart chronyd' (Expected 0, got 0) :: [ 13:49:39 ] :: This directory should be still available :: [ BEGIN ] :: Running 'ls /sys/fs/cgroup/cpu,cpuacct/machine.slice' cgroup.clone_children cgroup.procs cpuacct.usage cpu.cfs_period_us cpu.rt_period_us cpu.shares machine-qemu\x2dkvm1.scope tasks cgroup.event_control cpuacct.stat cpuacct.usage_percpu cpu.cfs_quota_us cpu.rt_runtime_us cpu.stat notify_on_release :: [ PASS ] :: Command 'ls /sys/fs/cgroup/cpu,cpuacct/machine.slice' (Expected 0, got 0) Any idea when these packages will show up in the service stream? Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0509.html |