Bug 1789824
Summary: | cpu.shares for existing scopes under machine.slice are reset to default when creating a new scope after daemon-reload | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Ben Arblaster <ben> | ||||
Component: | libvirt | Assignee: | Pavel Hrdina <phrdina> | ||||
Status: | CLOSED WONTFIX | QA Contact: | yisun | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 7.7 | CC: | ahadas, ben, d.konrad, dyuan, gveitmic, jsuchane, lmen, marcandre.lureau, mkalinin, msekleta, neil, nkoenig, phrdina, systemd-maint-list, xuzhang, yalzhang, yanglei0385 | ||||
Target Milestone: | rc | Keywords: | Triaged | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1798463 (view as bug list) | Environment: | |||||
Last Closed: | 2021-03-03 12:47:30 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1798463, 1798464, 1934484 | ||||||
Attachments: |
|
It is possible that libvirt sets the CPU shares manually and not through systemd APIs. In which case reset of the value on daemon-reload would be expected. Can you please call systemctl show <machine-scope> command after step (1) and post the results here? Machine scope after step 1 as requested # systemctl show 'machine-qemu\x2d464\x2dtest1.scope' TimeoutStopUSec=1min 30s Result=success Slice=machine.slice ControlGroup=/machine.slice/machine-qemu\x2d464\x2dtest1.scope MemoryCurrent=34344960 TasksCurrent=6 Delegate=yes CPUAccounting=no CPUShares=18446744073709551615 StartupCPUShares=18446744073709551615 CPUQuotaPerSecUSec=infinity BlockIOAccounting=no BlockIOWeight=18446744073709551615 StartupBlockIOWeight=18446744073709551615 MemoryAccounting=no MemoryLimit=18446744073709551615 DevicePolicy=auto TasksAccounting=no TasksMax=18446744073709551615 KillMode=control-group KillSignal=15 SendSIGKILL=yes SendSIGHUP=no Id=machine-qemu\x2d464\x2dtest1.scope Names=machine-qemu\x2d464\x2dtest1.scope Requires=machine.slice Conflicts=shutdown.target Before=virt-guest-shutdown.target shutdown.target After=machine.slice libvirtd.service Description=Virtual Machine qemu-464-test1 LoadState=loaded ActiveState=active SubState=running FragmentPath=/run/systemd/system/machine-qemu\x2d464\x2dtest1.scope DropInPaths=/run/systemd/system/machine-qemu\x2d464\x2dtest1.scope.d/50-After-libvirtd\x2eservice.conf /run/systemd/system/machine-qemu\x2d464\x2dtest1 UnitFileState=static UnitFilePreset=disabled InactiveExitTimestamp=Mon 2020-01-13 14:01:31 UTC InactiveExitTimestampMonotonic=508938827993 ActiveEnterTimestamp=Mon 2020-01-13 14:01:31 UTC ActiveEnterTimestampMonotonic=508938827993 ActiveExitTimestampMonotonic=0 InactiveEnterTimestampMonotonic=0 CanStart=yes CanStop=yes CanReload=no CanIsolate=no StopWhenUnneeded=no RefuseManualStart=no RefuseManualStop=no AllowIsolate=no DefaultDependencies=yes OnFailureJobMode=replace IgnoreOnIsolate=yes IgnoreOnSnapshot=yes NeedDaemonReload=no JobTimeoutUSec=0 JobTimeoutAction=none ConditionResult=yes AssertResult=yes ConditionTimestamp=Mon 2020-01-13 14:01:31 UTC ConditionTimestampMonotonic=508938817901 AssertTimestamp=Mon 2020-01-13 14:01:31 UTC AssertTimestampMonotonic=508938817902 Transient=yes (In reply to Michal Sekletar from comment #2) > It is possible that libvirt sets the CPU shares manually and not through > systemd APIs. In which case reset of the value on daemon-reload would be > expected. This appears to be the case. If I use 'virsh schedinfo' to modify the cpu shares of a running VM, I don't see any dbus messages between libvirt and systemd and I can see that the mtime of the relevant file under the cgroupfs mount gets updated. The output of 'systemctl show <machine-scope>' is the same before and after. # ls -lh '/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d419\x2dtest1.scope/cpu.shares' -rw-r--r--. 1 root root 0 Jan 13 14:18 /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d419\x2dtest1.scope/cpu.shares # virsh schedinfo test1 --live --set cpu_shares=4096 Scheduler : posix cpu_shares : 4096 vcpu_period : 100000 vcpu_quota : -1 emulator_period: 100000 emulator_quota : -1 global_period : 100000 global_quota : -1 iothread_period: 100000 iothread_quota : -1 # ls -lh '/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d419\x2dtest1.scope/cpu.shares' -rw-r--r--. 1 root root 0 Jan 13 14:21 /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d419\x2dtest1.scope/cpu.shares I think this needs to be fixed on libvirt side then. (In reply to Michal Sekletar from comment #5) > I think this needs to be fixed on libvirt side then. That's not true, from comment 3 it is clear that the Delegate is enabled which means that libvirt can write to cgroups files directly and is the only single-writer [1]. This sounds more like a bug in systemd where the Delegation is not honored in the daemon-reload case. Michal, if you agree please move this BZ back to systemd. [1] <>https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/ Actually, most up2date documentation on cgroup delegation lives here, https://systemd.io/CGROUP_DELEGATION/ Note 1) in Delegation section, 1) systemd won’t fiddle with your sub-tree of the cgroup tree anymore. It won’t change attributes of any cgroups below it, nor will it create or remove any cgroups thereunder, nor migrate processes across the boundaries of that sub-tree as it deems useful anymore. Also note Dos and Dont's section, specifically, 🚫 Never write to any of the attributes of a cgroup systemd created for you. It’s systemd’s private property. You are welcome to manipulate the attributes of cgroups you created in your own delegated sub-tree, but the cgroup tree of systemd itself is out of limits for you. It’s fine to read from any attribute you like however. That’s totally OK and welcome. Hence if you register the machine and systemd creates the scope unit for you then cgroup properties of the scope itself are off-limits to you. Only properties of your own sub-cgroups that you create *below* the scope can be modified directly. Based on the Comment 7 and our discussion with Michal the issue is in fact in libvirt. We use the delegation incorrectly the whole time systemd exists and we need to rewrite it in libvirt. This affects upstream libvirt as well. Hi, is there any activity on this? Is there a workaround to tell systemd to not touch the cgroup cpu.shares to tolerate libvirts changes? This issue is creates a lot of steal time for our VMs (as one would expect). There are a lot of tickets for this but not a lot to see in them: https://bugzilla.redhat.com/show_bug.cgi?id=1798463 https://bugzilla.redhat.com/show_bug.cgi?id=1798464 Best regards Dennis No activity yet, it will require some redesigning of libvirt internals for cgroups. I'm not aware of any workaround from systemd side. I'll look into it ASAP, but it will take some time to backport it to RHEL-7. The other bugs are trackers to make sure that it is backported to RHEL-8 as well. I'm currently trying to wrap my head around https://systemd.io/CGROUP_DELEGATION/. Shouldn't it be possible to delegate to the cgroup controller used by qemu? After creating the VM one could manually delegate via the scope to avoid changes from systemd. Based on the following excerpt from above: ... By turning on the Delegate= property for a scope or service you get a few guarantees: 1. systemd won’t fiddle with your sub-tree of the cgroup tree anymore. It won’t change attributes of any cgroups below it, nor will it create or remove any cgroups thereunder, nor migrate processes across the boundaries of that sub-tree as it deems useful anymore. ... After some more reading I also see that there's the need to rewrite libvirts handling of cgroups created by systemd. I think a possible workaround could be to implement a qemu hook that explicitly sets the systemd cpushares/cpuweight to match what libvirt sets directly (and thereby not using the systemd utilities). Will try getting that to work in a few days and report back if that's working. One thing puzzles me though: How was the current implementation working until a few month anyway? My understanding is this should have been broken from the beginning on. It is possible to work around this issue with a qemu hook that uses 'systemctl set-property' to let systemd know about the shares assigned to a VM. I reproduced the error with the steps above. After setting the shares explicitly (e.g. systemctl set-property $vm_scope CPUShares=$shares) it doesn't reset the value as one would expect. That's basically what we will have to do in libvirt, use dbus to talk to systemd to set some of the values via systemd and also add another directory to the cgroups topology where we are allowed to do whatever we need. Hi, is there any news about this issue? I encountered a similar bug. When I set a cpu.share in XML and start VM, the value in /sys/fs/cgroup/cpu/cpu.shares is correct. However, after "sytemctl daemon-reload"(also a new ssh would trigger), it will be changed into previous value. I guess that's the same problem. The newest version(20201204) of libvirt vircgourpv1.c/vircgourpv2.c use virCgroupSetValueU64() to set cgroup property. Should we register another CALLBACK to call "systemctl setproperty"? Or releated dbus api? (In reply to Yanglei from comment #17) > Hi, is there any news about this issue? I encountered a similar bug. > > When I set a cpu.share in XML and start VM, the value in > /sys/fs/cgroup/cpu/cpu.shares is correct. > However, after "sytemctl daemon-reload"(also a new ssh would trigger), it > will be changed into previous value. I guess that's the same problem. > > The newest version(20201204) of libvirt vircgourpv1.c/vircgourpv2.c use > virCgroupSetValueU64() to set cgroup property. > Should we register another CALLBACK to call "systemctl setproperty"? Or > releated dbus api? /sys/fs/cgroup/cpu/<vm scope>/cpu.shares sorry its a slip of the pen. Fixed in upstream: 6a1f5e8a4f vircgroup: correctly free nested virCgroupPtr 85099c3393 tests: add cgroup nested tests 184245f53b vircgroup: introduce nested cgroup to properly work with systemd badc2bcc73 vircgroup: introduce virCgroupV1Exists and virCgroupV2Exists 382fa15cde vircgroupv2: move task into cgroup before enabling controllers 5f56dd7c83 vircgroupv1: refactor virCgroupV1DetectPlacement 9c1693eff4 vircgroup: use DBus call to systemd for some APIs d3fb774b1e virsystemd: introduce virSystemdGetMachineUnitByPID 385704d5a4 virsystemd: introduce virSystemdGetMachineByPID a51147d906 virsystemd: export virSystemdHasMachined Based on internal discussion I'm closing this BZ as wontfix. In RHEL-7 we are now in maintenance support 2 phase and this issue was not classify to be critical enough. There is an existing workaround by using libvirt hooks [1] where the script can call `systemctl set-property` to set the value in systemd to prevent `systemctl daemon-reload` resetting that value. [1] <https://libvirt.org/hooks.html> |
Created attachment 1651301 [details] libvirt XML for test VM Description of problem: The cgroups cpu.shares limits applied to existing scopes (e.g. machine-qemu*, systemd-nspawn*) under /sys/fs/cgroup/cpu/machine.slice/ are reset to the default value (of 1024) when next creating a new scope via machinectl or libvirt after performing a systemctl daemon-reload. However, if you manually create a new scope under machine.slice and change the cpu.shares to a different value, but don't allocate a process to it that is registered with systemd-machined, then the value of cpu.shares is preserved. This affects both (QEMU/KVM) VMs created via libvirt and nspawn containers created via machinectl. Version-Release number of selected component (if applicable): systemd-219-67.el7_7.2 libvirt-4.5.0-23.el7_7.3 How reproducible: Consistently reproducible. Steps to Reproduce: 1. virsh create test1.xml 2. systemctl daemon-reload 3. virsh create test2.xml 4. systemctl daemon-reload 5. virsh create test3.xml (See attached libvirt XML definition) Actual results: After (1): grep . /sys/fs/cgroup/cpu/machine.slice/*/cpu.shares /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d148\x2dtest1.scope/cpu.shares:2048 After (3): grep . /sys/fs/cgroup/cpu/machine.slice/*/cpu.shares /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d148\x2dtest1.scope/cpu.shares:1024 /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d150\x2dtest2.scope/cpu.shares:2048 After (5): grep . /sys/fs/cgroup/cpu/machine.slice/*/cpu.shares /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d148\x2dtest1.scope/cpu.shares:1024 /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d150\x2dtest2.scope/cpu.shares:1024 /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d151\x2dtest3.scope/cpu.shares:2048 Expected results: grep . /sys/fs/cgroup/cpu/machine.slice/*/cpu.shares /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d148\x2dtest1.scope/cpu.shares:2048 /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d150\x2dtest2.scope/cpu.shares:2048 /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d151\x2dtest3.scope/cpu.shares:2048 Additional info: This seems similar in scope to the runc bug detailed in #1455071 and the systemd bug detailed in #1139223. I've confirmed that the Delegate= directive is correctly set and applied to the existing machine scopes under machine.slice: # cat /run/systemd/system/machine-qemu\\x2d148\\x2dtest1.scope.d/50-Delegate.conf [Scope] Delegate=yes # systemctl show "machine-qemu\\x2d148\\x2dtest1.scope" | grep Delegate Delegate=yes