Bug 1789824

Summary: cpu.shares for existing scopes under machine.slice are reset to default when creating a new scope after daemon-reload
Product: Red Hat Enterprise Linux 7 Reporter: Ben Arblaster <ben>
Component: libvirtAssignee: Pavel Hrdina <phrdina>
Status: CLOSED WONTFIX QA Contact: yisun
Severity: unspecified Docs Contact:
Priority: high    
Version: 7.7CC: ahadas, ben, d.konrad, dyuan, gveitmic, jsuchane, lmen, marcandre.lureau, mkalinin, msekleta, neil, nkoenig, phrdina, systemd-maint-list, xuzhang, yalzhang, yanglei0385
Target Milestone: rcKeywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1798463 (view as bug list) Environment:
Last Closed: 2021-03-03 12:47:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1798463, 1798464, 1934484    
Attachments:
Description Flags
libvirt XML for test VM none

Description Ben Arblaster 2020-01-10 14:09:45 UTC
Created attachment 1651301 [details]
libvirt XML for test VM

Description of problem:

The cgroups cpu.shares limits applied to existing scopes (e.g. machine-qemu*, systemd-nspawn*) under /sys/fs/cgroup/cpu/machine.slice/ are reset to the default value (of 1024) when next creating a new scope via machinectl or libvirt after performing a systemctl daemon-reload.

However, if you manually create a new scope under machine.slice and change the cpu.shares to a different value, but don't allocate a process to it that is registered with systemd-machined, then the value of cpu.shares is preserved.

This affects both (QEMU/KVM) VMs created via libvirt and nspawn containers created via machinectl.


Version-Release number of selected component (if applicable):

systemd-219-67.el7_7.2
libvirt-4.5.0-23.el7_7.3


How reproducible:

Consistently reproducible.


Steps to Reproduce:

1. virsh create test1.xml
2. systemctl daemon-reload
3. virsh create test2.xml
4. systemctl daemon-reload
5. virsh create test3.xml

(See attached libvirt XML definition)


Actual results:

After (1):

grep . /sys/fs/cgroup/cpu/machine.slice/*/cpu.shares
/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d148\x2dtest1.scope/cpu.shares:2048

After (3):

grep . /sys/fs/cgroup/cpu/machine.slice/*/cpu.shares
/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d148\x2dtest1.scope/cpu.shares:1024
/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d150\x2dtest2.scope/cpu.shares:2048

After (5):

grep . /sys/fs/cgroup/cpu/machine.slice/*/cpu.shares
/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d148\x2dtest1.scope/cpu.shares:1024
/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d150\x2dtest2.scope/cpu.shares:1024
/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d151\x2dtest3.scope/cpu.shares:2048


Expected results:

grep . /sys/fs/cgroup/cpu/machine.slice/*/cpu.shares
/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d148\x2dtest1.scope/cpu.shares:2048
/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d150\x2dtest2.scope/cpu.shares:2048
/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d151\x2dtest3.scope/cpu.shares:2048


Additional info:

This seems similar in scope to the runc bug detailed in #1455071 and the systemd bug detailed in #1139223.

I've confirmed that the Delegate= directive is correctly set and applied to the existing machine scopes under machine.slice:

# cat /run/systemd/system/machine-qemu\\x2d148\\x2dtest1.scope.d/50-Delegate.conf
[Scope]
Delegate=yes

# systemctl show "machine-qemu\\x2d148\\x2dtest1.scope" | grep Delegate
Delegate=yes

Comment 2 Michal Sekletar 2020-01-13 13:34:11 UTC
It is possible that libvirt sets the CPU shares manually and not through systemd APIs. In which case reset of the value on daemon-reload would be expected. 

Can you please call systemctl show <machine-scope> command after step (1) and post the results here?

Comment 3 Neil Wilson 2020-01-13 14:07:56 UTC
Machine scope after step 1 as requested

# systemctl show 'machine-qemu\x2d464\x2dtest1.scope' 
TimeoutStopUSec=1min 30s
Result=success
Slice=machine.slice
ControlGroup=/machine.slice/machine-qemu\x2d464\x2dtest1.scope
MemoryCurrent=34344960
TasksCurrent=6
Delegate=yes
CPUAccounting=no
CPUShares=18446744073709551615
StartupCPUShares=18446744073709551615
CPUQuotaPerSecUSec=infinity
BlockIOAccounting=no
BlockIOWeight=18446744073709551615
StartupBlockIOWeight=18446744073709551615
MemoryAccounting=no
MemoryLimit=18446744073709551615
DevicePolicy=auto
TasksAccounting=no
TasksMax=18446744073709551615
KillMode=control-group
KillSignal=15
SendSIGKILL=yes
SendSIGHUP=no
Id=machine-qemu\x2d464\x2dtest1.scope
Names=machine-qemu\x2d464\x2dtest1.scope
Requires=machine.slice
Conflicts=shutdown.target
Before=virt-guest-shutdown.target shutdown.target
After=machine.slice libvirtd.service
Description=Virtual Machine qemu-464-test1
LoadState=loaded
ActiveState=active
SubState=running
FragmentPath=/run/systemd/system/machine-qemu\x2d464\x2dtest1.scope
DropInPaths=/run/systemd/system/machine-qemu\x2d464\x2dtest1.scope.d/50-After-libvirtd\x2eservice.conf /run/systemd/system/machine-qemu\x2d464\x2dtest1
UnitFileState=static
UnitFilePreset=disabled
InactiveExitTimestamp=Mon 2020-01-13 14:01:31 UTC
InactiveExitTimestampMonotonic=508938827993
ActiveEnterTimestamp=Mon 2020-01-13 14:01:31 UTC
ActiveEnterTimestampMonotonic=508938827993
ActiveExitTimestampMonotonic=0
InactiveEnterTimestampMonotonic=0
CanStart=yes
CanStop=yes
CanReload=no
CanIsolate=no
StopWhenUnneeded=no
RefuseManualStart=no
RefuseManualStop=no
AllowIsolate=no
DefaultDependencies=yes
OnFailureJobMode=replace
IgnoreOnIsolate=yes
IgnoreOnSnapshot=yes
NeedDaemonReload=no
JobTimeoutUSec=0
JobTimeoutAction=none
ConditionResult=yes
AssertResult=yes
ConditionTimestamp=Mon 2020-01-13 14:01:31 UTC
ConditionTimestampMonotonic=508938817901
AssertTimestamp=Mon 2020-01-13 14:01:31 UTC
AssertTimestampMonotonic=508938817902
Transient=yes

Comment 4 Ben Arblaster 2020-01-13 14:31:48 UTC
(In reply to Michal Sekletar from comment #2)
> It is possible that libvirt sets the CPU shares manually and not through
> systemd APIs. In which case reset of the value on daemon-reload would be
> expected. 

This appears to be the case. If I use 'virsh schedinfo' to modify the cpu shares of a running VM, I don't see any dbus messages between libvirt and systemd and I can see that the mtime of the relevant file under the cgroupfs mount gets updated. The output of 'systemctl show <machine-scope>' is the same before and after.

# ls -lh '/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d419\x2dtest1.scope/cpu.shares'
-rw-r--r--. 1 root root 0 Jan 13 14:18 /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d419\x2dtest1.scope/cpu.shares

# virsh schedinfo test1 --live --set cpu_shares=4096
Scheduler      : posix
cpu_shares     : 4096
vcpu_period    : 100000
vcpu_quota     : -1
emulator_period: 100000
emulator_quota : -1
global_period  : 100000
global_quota   : -1
iothread_period: 100000
iothread_quota : -1

# ls -lh '/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d419\x2dtest1.scope/cpu.shares'
-rw-r--r--. 1 root root 0 Jan 13 14:21 /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d419\x2dtest1.scope/cpu.shares

Comment 5 Michal Sekletar 2020-01-14 12:48:33 UTC
I think this needs to be fixed on libvirt side then.

Comment 6 Pavel Hrdina 2020-01-14 13:40:21 UTC
(In reply to Michal Sekletar from comment #5)
> I think this needs to be fixed on libvirt side then.

That's not true, from comment 3 it is clear that the Delegate is enabled which means that libvirt can write to cgroups files directly and is the only single-writer [1].

This sounds more like a bug in systemd where the Delegation is not honored in the daemon-reload case.

Michal, if you agree please move this BZ back to systemd.

[1] <>https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/

Comment 7 Michal Sekletar 2020-01-16 09:28:00 UTC
Actually, most up2date documentation on cgroup delegation lives here,

https://systemd.io/CGROUP_DELEGATION/

Note 1) in Delegation section,

1) systemd won’t fiddle with your sub-tree of the cgroup tree anymore. It won’t change attributes of any cgroups below it, nor will it create or remove any cgroups thereunder, nor migrate processes across the boundaries of that sub-tree as it deems useful anymore.

Also note Dos and Dont's section, specifically,

🚫 Never write to any of the attributes of a cgroup systemd created for you. It’s systemd’s private property. You are welcome to manipulate the attributes of cgroups you created in your own delegated sub-tree, but the cgroup tree of systemd itself is out of limits for you. It’s fine to read from any attribute you like however. That’s totally OK and welcome.

Hence if you register the machine and systemd creates the scope unit for you then cgroup properties of the scope itself are off-limits to you. Only properties of your own sub-cgroups that you create *below* the scope can be modified directly.

Comment 8 Pavel Hrdina 2020-03-04 13:55:04 UTC
Based on the Comment 7 and our discussion with Michal the issue is in fact in libvirt.  We use the delegation incorrectly the whole time systemd exists and we need to rewrite it in libvirt.  This affects upstream libvirt as well.

Comment 9 Dennis Konrad 2020-03-06 13:18:33 UTC
Hi,

is there any activity on this? Is there a workaround to tell systemd to not touch the cgroup cpu.shares to tolerate libvirts changes?
This issue is creates a lot of steal time for our VMs (as one would expect).

There are a lot of tickets for this but not a lot to see in them:

https://bugzilla.redhat.com/show_bug.cgi?id=1798463
https://bugzilla.redhat.com/show_bug.cgi?id=1798464

Best regards
Dennis

Comment 10 Pavel Hrdina 2020-03-06 13:42:52 UTC
No activity yet, it will require some redesigning of libvirt internals for cgroups.  I'm not aware of any workaround from systemd side.  I'll look into it ASAP, but it will take some time to backport it to RHEL-7.  The other bugs are trackers to make sure that it is backported to RHEL-8 as well.

Comment 11 Dennis Konrad 2020-03-06 13:53:55 UTC
I'm currently trying to wrap my head around https://systemd.io/CGROUP_DELEGATION/. 

Shouldn't it be possible to delegate to the cgroup controller used by qemu? After creating the VM one could manually delegate via the scope to avoid changes from systemd.
Based on the following excerpt from above:

...
By turning on the Delegate= property for a scope or service you get a few guarantees:

1. systemd won’t fiddle with your sub-tree of the cgroup tree anymore. It won’t change attributes of any cgroups below it, nor will it create or remove any cgroups thereunder, nor migrate processes across the boundaries of that sub-tree as it deems useful anymore.
...

Comment 12 Dennis Konrad 2020-03-06 19:33:53 UTC
After some more reading I also see that there's the need to rewrite libvirts handling of cgroups created by systemd.

I think a possible workaround could be to implement a qemu hook that explicitly sets the systemd cpushares/cpuweight to match what libvirt sets directly (and thereby not using the systemd utilities).

Will try getting that to work in a few days and report back if that's working.

One thing puzzles me though: How was the current implementation working until a few month anyway? My understanding is this should have been broken from the beginning on.

Comment 13 Dennis Konrad 2020-03-09 10:28:23 UTC
It is possible to work around this issue with a qemu hook that uses 'systemctl set-property' to let systemd know about the shares assigned to a VM.

I reproduced the error with the steps above. After setting the shares explicitly (e.g. systemctl set-property $vm_scope CPUShares=$shares) it doesn't reset the value as one would expect.

Comment 14 Pavel Hrdina 2020-03-10 13:12:42 UTC
That's basically what we will have to do in libvirt, use dbus to talk to systemd to set some of the values via systemd and also add another directory to the cgroups topology where we are allowed to do whatever we need.

Comment 17 Yanglei 2020-12-04 11:22:01 UTC
Hi, is there any news about this issue? I encountered a similar bug.

When I set a cpu.share in XML and start VM, the value in /sys/fs/cgroup/cpu/cpu.shares is correct.
However, after "sytemctl daemon-reload"(also a new ssh would trigger), it will be changed into previous value. I guess that's the same problem.

The newest version(20201204) of libvirt vircgourpv1.c/vircgourpv2.c use virCgroupSetValueU64() to set cgroup property.
Should we register another CALLBACK to call "systemctl setproperty"? Or releated dbus api?

Comment 18 Yanglei 2020-12-04 11:30:06 UTC
(In reply to Yanglei from comment #17)
> Hi, is there any news about this issue? I encountered a similar bug.
> 
> When I set a cpu.share in XML and start VM, the value in
> /sys/fs/cgroup/cpu/cpu.shares is correct.
> However, after "sytemctl daemon-reload"(also a new ssh would trigger), it
> will be changed into previous value. I guess that's the same problem.
> 
> The newest version(20201204) of libvirt vircgourpv1.c/vircgourpv2.c use
> virCgroupSetValueU64() to set cgroup property.
> Should we register another CALLBACK to call "systemctl setproperty"? Or
> releated dbus api?

/sys/fs/cgroup/cpu/<vm scope>/cpu.shares
sorry its a slip of the pen.

Comment 19 Pavel Hrdina 2021-02-18 14:08:44 UTC
Fixed in upstream:

6a1f5e8a4f vircgroup: correctly free nested virCgroupPtr
85099c3393 tests: add cgroup nested tests
184245f53b vircgroup: introduce nested cgroup to properly work with systemd
badc2bcc73 vircgroup: introduce virCgroupV1Exists and virCgroupV2Exists
382fa15cde vircgroupv2: move task into cgroup before enabling controllers
5f56dd7c83 vircgroupv1: refactor virCgroupV1DetectPlacement
9c1693eff4 vircgroup: use DBus call to systemd for some APIs
d3fb774b1e virsystemd: introduce virSystemdGetMachineUnitByPID
385704d5a4 virsystemd: introduce virSystemdGetMachineByPID
a51147d906 virsystemd: export virSystemdHasMachined

Comment 27 Pavel Hrdina 2021-03-03 12:47:30 UTC
Based on internal discussion I'm closing this BZ as wontfix.

In RHEL-7 we are now in maintenance support 2 phase and this issue was not classify to be critical enough.

There is an existing workaround by using libvirt hooks [1] where the script can call `systemctl set-property` to set the value in systemd to prevent `systemctl daemon-reload` resetting that value.

[1] <https://libvirt.org/hooks.html>