RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1789824 - cpu.shares for existing scopes under machine.slice are reset to default when creating a new scope after daemon-reload
Summary: cpu.shares for existing scopes under machine.slice are reset to default when ...
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt
Version: 7.7
Hardware: x86_64
OS: Linux
high
unspecified
Target Milestone: rc
: ---
Assignee: Pavel Hrdina
QA Contact: yisun
URL:
Whiteboard:
Depends On:
Blocks: 1798463 1798464 1934484
TreeView+ depends on / blocked
 
Reported: 2020-01-10 14:09 UTC by Ben Arblaster
Modified: 2021-03-03 12:48 UTC (History)
17 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1798463 (view as bug list)
Environment:
Last Closed: 2021-03-03 12:47:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
libvirt XML for test VM (2.50 KB, text/plain)
2020-01-10 14:09 UTC, Ben Arblaster
no flags Details

Description Ben Arblaster 2020-01-10 14:09:45 UTC
Created attachment 1651301 [details]
libvirt XML for test VM

Description of problem:

The cgroups cpu.shares limits applied to existing scopes (e.g. machine-qemu*, systemd-nspawn*) under /sys/fs/cgroup/cpu/machine.slice/ are reset to the default value (of 1024) when next creating a new scope via machinectl or libvirt after performing a systemctl daemon-reload.

However, if you manually create a new scope under machine.slice and change the cpu.shares to a different value, but don't allocate a process to it that is registered with systemd-machined, then the value of cpu.shares is preserved.

This affects both (QEMU/KVM) VMs created via libvirt and nspawn containers created via machinectl.


Version-Release number of selected component (if applicable):

systemd-219-67.el7_7.2
libvirt-4.5.0-23.el7_7.3


How reproducible:

Consistently reproducible.


Steps to Reproduce:

1. virsh create test1.xml
2. systemctl daemon-reload
3. virsh create test2.xml
4. systemctl daemon-reload
5. virsh create test3.xml

(See attached libvirt XML definition)


Actual results:

After (1):

grep . /sys/fs/cgroup/cpu/machine.slice/*/cpu.shares
/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d148\x2dtest1.scope/cpu.shares:2048

After (3):

grep . /sys/fs/cgroup/cpu/machine.slice/*/cpu.shares
/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d148\x2dtest1.scope/cpu.shares:1024
/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d150\x2dtest2.scope/cpu.shares:2048

After (5):

grep . /sys/fs/cgroup/cpu/machine.slice/*/cpu.shares
/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d148\x2dtest1.scope/cpu.shares:1024
/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d150\x2dtest2.scope/cpu.shares:1024
/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d151\x2dtest3.scope/cpu.shares:2048


Expected results:

grep . /sys/fs/cgroup/cpu/machine.slice/*/cpu.shares
/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d148\x2dtest1.scope/cpu.shares:2048
/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d150\x2dtest2.scope/cpu.shares:2048
/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d151\x2dtest3.scope/cpu.shares:2048


Additional info:

This seems similar in scope to the runc bug detailed in #1455071 and the systemd bug detailed in #1139223.

I've confirmed that the Delegate= directive is correctly set and applied to the existing machine scopes under machine.slice:

# cat /run/systemd/system/machine-qemu\\x2d148\\x2dtest1.scope.d/50-Delegate.conf
[Scope]
Delegate=yes

# systemctl show "machine-qemu\\x2d148\\x2dtest1.scope" | grep Delegate
Delegate=yes

Comment 2 Michal Sekletar 2020-01-13 13:34:11 UTC
It is possible that libvirt sets the CPU shares manually and not through systemd APIs. In which case reset of the value on daemon-reload would be expected. 

Can you please call systemctl show <machine-scope> command after step (1) and post the results here?

Comment 3 Neil Wilson 2020-01-13 14:07:56 UTC
Machine scope after step 1 as requested

# systemctl show 'machine-qemu\x2d464\x2dtest1.scope' 
TimeoutStopUSec=1min 30s
Result=success
Slice=machine.slice
ControlGroup=/machine.slice/machine-qemu\x2d464\x2dtest1.scope
MemoryCurrent=34344960
TasksCurrent=6
Delegate=yes
CPUAccounting=no
CPUShares=18446744073709551615
StartupCPUShares=18446744073709551615
CPUQuotaPerSecUSec=infinity
BlockIOAccounting=no
BlockIOWeight=18446744073709551615
StartupBlockIOWeight=18446744073709551615
MemoryAccounting=no
MemoryLimit=18446744073709551615
DevicePolicy=auto
TasksAccounting=no
TasksMax=18446744073709551615
KillMode=control-group
KillSignal=15
SendSIGKILL=yes
SendSIGHUP=no
Id=machine-qemu\x2d464\x2dtest1.scope
Names=machine-qemu\x2d464\x2dtest1.scope
Requires=machine.slice
Conflicts=shutdown.target
Before=virt-guest-shutdown.target shutdown.target
After=machine.slice libvirtd.service
Description=Virtual Machine qemu-464-test1
LoadState=loaded
ActiveState=active
SubState=running
FragmentPath=/run/systemd/system/machine-qemu\x2d464\x2dtest1.scope
DropInPaths=/run/systemd/system/machine-qemu\x2d464\x2dtest1.scope.d/50-After-libvirtd\x2eservice.conf /run/systemd/system/machine-qemu\x2d464\x2dtest1
UnitFileState=static
UnitFilePreset=disabled
InactiveExitTimestamp=Mon 2020-01-13 14:01:31 UTC
InactiveExitTimestampMonotonic=508938827993
ActiveEnterTimestamp=Mon 2020-01-13 14:01:31 UTC
ActiveEnterTimestampMonotonic=508938827993
ActiveExitTimestampMonotonic=0
InactiveEnterTimestampMonotonic=0
CanStart=yes
CanStop=yes
CanReload=no
CanIsolate=no
StopWhenUnneeded=no
RefuseManualStart=no
RefuseManualStop=no
AllowIsolate=no
DefaultDependencies=yes
OnFailureJobMode=replace
IgnoreOnIsolate=yes
IgnoreOnSnapshot=yes
NeedDaemonReload=no
JobTimeoutUSec=0
JobTimeoutAction=none
ConditionResult=yes
AssertResult=yes
ConditionTimestamp=Mon 2020-01-13 14:01:31 UTC
ConditionTimestampMonotonic=508938817901
AssertTimestamp=Mon 2020-01-13 14:01:31 UTC
AssertTimestampMonotonic=508938817902
Transient=yes

Comment 4 Ben Arblaster 2020-01-13 14:31:48 UTC
(In reply to Michal Sekletar from comment #2)
> It is possible that libvirt sets the CPU shares manually and not through
> systemd APIs. In which case reset of the value on daemon-reload would be
> expected. 

This appears to be the case. If I use 'virsh schedinfo' to modify the cpu shares of a running VM, I don't see any dbus messages between libvirt and systemd and I can see that the mtime of the relevant file under the cgroupfs mount gets updated. The output of 'systemctl show <machine-scope>' is the same before and after.

# ls -lh '/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d419\x2dtest1.scope/cpu.shares'
-rw-r--r--. 1 root root 0 Jan 13 14:18 /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d419\x2dtest1.scope/cpu.shares

# virsh schedinfo test1 --live --set cpu_shares=4096
Scheduler      : posix
cpu_shares     : 4096
vcpu_period    : 100000
vcpu_quota     : -1
emulator_period: 100000
emulator_quota : -1
global_period  : 100000
global_quota   : -1
iothread_period: 100000
iothread_quota : -1

# ls -lh '/sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d419\x2dtest1.scope/cpu.shares'
-rw-r--r--. 1 root root 0 Jan 13 14:21 /sys/fs/cgroup/cpu/machine.slice/machine-qemu\x2d419\x2dtest1.scope/cpu.shares

Comment 5 Michal Sekletar 2020-01-14 12:48:33 UTC
I think this needs to be fixed on libvirt side then.

Comment 6 Pavel Hrdina 2020-01-14 13:40:21 UTC
(In reply to Michal Sekletar from comment #5)
> I think this needs to be fixed on libvirt side then.

That's not true, from comment 3 it is clear that the Delegate is enabled which means that libvirt can write to cgroups files directly and is the only single-writer [1].

This sounds more like a bug in systemd where the Delegation is not honored in the daemon-reload case.

Michal, if you agree please move this BZ back to systemd.

[1] <>https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface/

Comment 7 Michal Sekletar 2020-01-16 09:28:00 UTC
Actually, most up2date documentation on cgroup delegation lives here,

https://systemd.io/CGROUP_DELEGATION/

Note 1) in Delegation section,

1) systemd won’t fiddle with your sub-tree of the cgroup tree anymore. It won’t change attributes of any cgroups below it, nor will it create or remove any cgroups thereunder, nor migrate processes across the boundaries of that sub-tree as it deems useful anymore.

Also note Dos and Dont's section, specifically,

🚫 Never write to any of the attributes of a cgroup systemd created for you. It’s systemd’s private property. You are welcome to manipulate the attributes of cgroups you created in your own delegated sub-tree, but the cgroup tree of systemd itself is out of limits for you. It’s fine to read from any attribute you like however. That’s totally OK and welcome.

Hence if you register the machine and systemd creates the scope unit for you then cgroup properties of the scope itself are off-limits to you. Only properties of your own sub-cgroups that you create *below* the scope can be modified directly.

Comment 8 Pavel Hrdina 2020-03-04 13:55:04 UTC
Based on the Comment 7 and our discussion with Michal the issue is in fact in libvirt.  We use the delegation incorrectly the whole time systemd exists and we need to rewrite it in libvirt.  This affects upstream libvirt as well.

Comment 9 Dennis Konrad 2020-03-06 13:18:33 UTC
Hi,

is there any activity on this? Is there a workaround to tell systemd to not touch the cgroup cpu.shares to tolerate libvirts changes?
This issue is creates a lot of steal time for our VMs (as one would expect).

There are a lot of tickets for this but not a lot to see in them:

https://bugzilla.redhat.com/show_bug.cgi?id=1798463
https://bugzilla.redhat.com/show_bug.cgi?id=1798464

Best regards
Dennis

Comment 10 Pavel Hrdina 2020-03-06 13:42:52 UTC
No activity yet, it will require some redesigning of libvirt internals for cgroups.  I'm not aware of any workaround from systemd side.  I'll look into it ASAP, but it will take some time to backport it to RHEL-7.  The other bugs are trackers to make sure that it is backported to RHEL-8 as well.

Comment 11 Dennis Konrad 2020-03-06 13:53:55 UTC
I'm currently trying to wrap my head around https://systemd.io/CGROUP_DELEGATION/. 

Shouldn't it be possible to delegate to the cgroup controller used by qemu? After creating the VM one could manually delegate via the scope to avoid changes from systemd.
Based on the following excerpt from above:

...
By turning on the Delegate= property for a scope or service you get a few guarantees:

1. systemd won’t fiddle with your sub-tree of the cgroup tree anymore. It won’t change attributes of any cgroups below it, nor will it create or remove any cgroups thereunder, nor migrate processes across the boundaries of that sub-tree as it deems useful anymore.
...

Comment 12 Dennis Konrad 2020-03-06 19:33:53 UTC
After some more reading I also see that there's the need to rewrite libvirts handling of cgroups created by systemd.

I think a possible workaround could be to implement a qemu hook that explicitly sets the systemd cpushares/cpuweight to match what libvirt sets directly (and thereby not using the systemd utilities).

Will try getting that to work in a few days and report back if that's working.

One thing puzzles me though: How was the current implementation working until a few month anyway? My understanding is this should have been broken from the beginning on.

Comment 13 Dennis Konrad 2020-03-09 10:28:23 UTC
It is possible to work around this issue with a qemu hook that uses 'systemctl set-property' to let systemd know about the shares assigned to a VM.

I reproduced the error with the steps above. After setting the shares explicitly (e.g. systemctl set-property $vm_scope CPUShares=$shares) it doesn't reset the value as one would expect.

Comment 14 Pavel Hrdina 2020-03-10 13:12:42 UTC
That's basically what we will have to do in libvirt, use dbus to talk to systemd to set some of the values via systemd and also add another directory to the cgroups topology where we are allowed to do whatever we need.

Comment 17 Yanglei 2020-12-04 11:22:01 UTC
Hi, is there any news about this issue? I encountered a similar bug.

When I set a cpu.share in XML and start VM, the value in /sys/fs/cgroup/cpu/cpu.shares is correct.
However, after "sytemctl daemon-reload"(also a new ssh would trigger), it will be changed into previous value. I guess that's the same problem.

The newest version(20201204) of libvirt vircgourpv1.c/vircgourpv2.c use virCgroupSetValueU64() to set cgroup property.
Should we register another CALLBACK to call "systemctl setproperty"? Or releated dbus api?

Comment 18 Yanglei 2020-12-04 11:30:06 UTC
(In reply to Yanglei from comment #17)
> Hi, is there any news about this issue? I encountered a similar bug.
> 
> When I set a cpu.share in XML and start VM, the value in
> /sys/fs/cgroup/cpu/cpu.shares is correct.
> However, after "sytemctl daemon-reload"(also a new ssh would trigger), it
> will be changed into previous value. I guess that's the same problem.
> 
> The newest version(20201204) of libvirt vircgourpv1.c/vircgourpv2.c use
> virCgroupSetValueU64() to set cgroup property.
> Should we register another CALLBACK to call "systemctl setproperty"? Or
> releated dbus api?

/sys/fs/cgroup/cpu/<vm scope>/cpu.shares
sorry its a slip of the pen.

Comment 19 Pavel Hrdina 2021-02-18 14:08:44 UTC
Fixed in upstream:

6a1f5e8a4f vircgroup: correctly free nested virCgroupPtr
85099c3393 tests: add cgroup nested tests
184245f53b vircgroup: introduce nested cgroup to properly work with systemd
badc2bcc73 vircgroup: introduce virCgroupV1Exists and virCgroupV2Exists
382fa15cde vircgroupv2: move task into cgroup before enabling controllers
5f56dd7c83 vircgroupv1: refactor virCgroupV1DetectPlacement
9c1693eff4 vircgroup: use DBus call to systemd for some APIs
d3fb774b1e virsystemd: introduce virSystemdGetMachineUnitByPID
385704d5a4 virsystemd: introduce virSystemdGetMachineByPID
a51147d906 virsystemd: export virSystemdHasMachined

Comment 27 Pavel Hrdina 2021-03-03 12:47:30 UTC
Based on internal discussion I'm closing this BZ as wontfix.

In RHEL-7 we are now in maintenance support 2 phase and this issue was not classify to be critical enough.

There is an existing workaround by using libvirt hooks [1] where the script can call `systemctl set-property` to set the value in systemd to prevent `systemctl daemon-reload` resetting that value.

[1] <https://libvirt.org/hooks.html>


Note You need to log in before you can comment on or make changes to this bug.