Bug 1956453

Summary: systemd-run sets incorrect value to /sys/fs/cgroup/*/cgroup.subtree_control with cgroupv2 on RHEL9
Product: Red Hat Enterprise Linux 9 Reporter: Troy Wilson <trwilson>
Component: systemdAssignee: systemd-maint
Status: CLOSED NOTABUG QA Contact: Frantisek Sumsal <fsumsal>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 9.0CC: dtardon, llong, systemd-maint-list
Target Milestone: beta   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-04 16:32:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Troy Wilson 2021-05-03 17:51:53 UTC
Description of problem:
When using systemd-run on RHEL9 with cgroupv2, systemd will put the value "cpu" into the /sys/fs/cgroup/*/cgroup.subtree_control files, which has the effect of destroying any existing v2 cgroup that is configured.  I don't know if systemd-run should or should not be putting anything in /sys/fs/cgroup/*/cgroup.subtree_control, but if it does, the keyword for cgroupv2 is "cpuset", not "cpu".  When the keyword "cpuset" is not present in the /sys/fs/cgroup/*/cgroup.subtree_control files, the other files used to control the cpusets no longer exist (cpuset.cpus, cpuset.cpus.partition etc).

Version-Release number of selected component (if applicable):
RHEL9: RHEL-9.0.0-20210428.3
Systemd: systemd 247 (v247.3-2.el9)

How reproducible:
100%

Steps to Reproduce:
1. Configure a v2 cgroup that includes cpusets
2. Invoke systemd-run with the --slice= argument pointing at the configured v2 cgroup
3.

Actual results:
This was captured during a run started via 'systemd-run --slice=user-1000.slice .....'
Prior to the run, the cgroup was configured as below in 'Expected results'.

cat /sys/fs/cgroup/cgroup.subtree_control
cpu memory pids
cat /sys/fs/cgroup/system.slice/cpuset.cpus
cat: /sys/fs/cgroup/system.slice/cpuset.cpus: No such file or directory
cat /sys/fs/cgroup/user.slice/cpuset.cpus
cat: /sys/fs/cgroup/user.slice/cpuset.cpus: No such file or directory
cat /sys/fs/cgroup/user.slice/cpuset.cpus.partition                                                                                  
cat: /sys/fs/cgroup/user.slice/cpuset.cpus.partition: No such file or directory                                                      
cat /sys/fs/cgroup/user.slice/cgroup.subtree_control
cpu memory pids
cat /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus
cat: /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus: No such file or directory
cat /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus.partition
cat: /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus.partition: No such file or directory

Expected results:
cat /sys/fs/cgroup/cgroup.subtree_control
cpuset memory pids
cat /sys/fs/cgroup/system.slice/cpuset.cpus

cat /sys/fs/cgroup/user.slice/cpuset.cpus
4-39
cat /sys/fs/cgroup/user.slice/cpuset.cpus.partition
root
cat /sys/fs/cgroup/user.slice/cgroup.subtree_control
cpuset memory pids
cat /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus
5-39
cat /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus.partition
root

Additional info:

Comment 1 David Tardon 2021-05-04 08:59:22 UTC
(In reply to Troy Wilson from comment #0)
> Description of problem:
> When using systemd-run on RHEL9 with cgroupv2, systemd will put the value
> "cpu" into the /sys/fs/cgroup/*/cgroup.subtree_control files, which has the
> effect of destroying any existing v2 cgroup that is configured.  I don't
> know if systemd-run should or should not be putting anything in
> /sys/fs/cgroup/*/cgroup.subtree_control,

systemd-run just forwards the command (together with supplied options) to systemd, which starts it as a transient unit.

 but if it does, the keyword for
> cgroupv2 is "cpuset", not "cpu".

https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/cgroup-v2.rst#cpu

> Steps to Reproduce:
> 1. Configure a v2 cgroup that includes cpusets

How did you configure it?

Comment 2 Troy Wilson 2021-05-04 15:08:14 UTC
>> but if it does, the keyword for
>> cgroupv2 is "cpuset", not "cpu".
>
> https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/cgroup-v2.rst#cpu

I stated that poorly and was thinking only in terms of cgroups, sorry.  I used the cpuset controller (https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/cgroup-v2.rst#cpuset) to configure a cgroup for user-1000.slice and then started a workload in that slice using systemd-run.  It looks like systemd adds the cpu controller to cgroup.subtree_control if I specify a CPUQuota... but it also seems to remove the cpuset controller, which I wouldn't expect.

>> Steps to Reproduce:
>> 1. Configure a v2 cgroup that includes cpusets
>
> How did you configure it?

My system has 40 CPUs, I assign 4-39 to user.slice and then 5-39 to user-1000.slice

     echo "+cpuset" > /sys/fs/cgroup/cgroup.subtree_control
     echo 4-39 > /sys/fs/cgroup/user.slice/cpuset.cpus
     echo "root" > /sys/fs/cgroup/user.slice/cpuset.cpus.partition
     echo "+cpuset" > /sys/fs/cgroup/user.slice/cgroup.subtree_control
     echo 5-39 > /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus
     echo root > /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus.partition

Here is an example of what I am seeing.

[root@fedora ~]# ./setup.sh
echo "+cpuset" > /sys/fs/cgroup/cgroup.subtree_control
echo 4-39 > /sys/fs/cgroup/user.slice/cpuset.cpus
echo "root" > /sys/fs/cgroup/user.slice/cpuset.cpus.partition
echo "+cpuset" > /sys/fs/cgroup/user.slice/cgroup.subtree_control
echo 5-39 > /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus
echo root > /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus.partition
[root@fedora ~]# ./state.sh 
cat /sys/fs/cgroup/cgroup.subtree_control
cpuset memory pids
cat /sys/fs/cgroup/system.slice/cpuset.cpus

cat /sys/fs/cgroup/user.slice/cpuset.cpus
4-39
cat /sys/fs/cgroup/user.slice/cpuset.cpus.partition
root
cat /sys/fs/cgroup/user.slice/cgroup.subtree_control
cpuset memory pids
cat /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus
5-39
cat /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus.partition
root
[root@fedora ~]# systemd-run --slice=user-1000.slice --property=CPUQuota=3500% sleep 10
Running as unit: run-r27a8c0ed364f4345b34eccc4e4871bee.service
[root@fedora ~]# ./state.sh 
cat /sys/fs/cgroup/cgroup.subtree_control
cpu memory pids                                                                                    <--- if I specify a CPUQuota, "cpu" replaces "cpuset" which destroys the cgroup  (this is collected during the 10 seconds while the sleep runs)
cat /sys/fs/cgroup/system.slice/cpuset.cpus
cat: /sys/fs/cgroup/system.slice/cpuset.cpus: No such file or directory
cat /sys/fs/cgroup/user.slice/cpuset.cpus
cat: /sys/fs/cgroup/user.slice/cpuset.cpus: No such file or directory
cat /sys/fs/cgroup/user.slice/cpuset.cpus.partition
cat: /sys/fs/cgroup/user.slice/cpuset.cpus.partition: No such file or directory
cat /sys/fs/cgroup/user.slice/cgroup.subtree_control
cpu memory pids
cat /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus
cat: /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus: No such file or directory
cat /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus.partition
cat: /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus.partition: No such file or directory
[root@fedora ~]# ./state.sh 
cat /sys/fs/cgroup/cgroup.subtree_control
memory pids                                                                                        <--- after the sleep has completed, both the "cpu" and "cpuset" keywords are gone
cat /sys/fs/cgroup/system.slice/cpuset.cpus
cat: /sys/fs/cgroup/system.slice/cpuset.cpus: No such file or directory
cat /sys/fs/cgroup/user.slice/cpuset.cpus
cat: /sys/fs/cgroup/user.slice/cpuset.cpus: No such file or directory
cat /sys/fs/cgroup/user.slice/cpuset.cpus.partition
cat: /sys/fs/cgroup/user.slice/cpuset.cpus.partition: No such file or directory
cat /sys/fs/cgroup/user.slice/cgroup.subtree_control
memory pids
cat /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus
cat: /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus: No such file or directory
cat /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus.partition
cat: /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus.partition: No such file or directory
[root@fedora ~]# 
[root@fedora ~]# 
[root@fedora ~]# ./setup.sh
echo "+cpuset" > /sys/fs/cgroup/cgroup.subtree_control
echo 4-39 > /sys/fs/cgroup/user.slice/cpuset.cpus
echo "root" > /sys/fs/cgroup/user.slice/cpuset.cpus.partition
echo "+cpuset" > /sys/fs/cgroup/user.slice/cgroup.subtree_control
echo 5-39 > /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus
echo root > /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus.partition
[root@fedora ~]# ./state.sh 
cat /sys/fs/cgroup/cgroup.subtree_control
cpuset memory pids
cat /sys/fs/cgroup/system.slice/cpuset.cpus

cat /sys/fs/cgroup/user.slice/cpuset.cpus
4-39
cat /sys/fs/cgroup/user.slice/cpuset.cpus.partition
root
cat /sys/fs/cgroup/user.slice/cgroup.subtree_control
cpuset memory pids
cat /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus
5-39
cat /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus.partition
root
[root@fedora ~]# systemd-run --slice=user-1000.slice  sleep 10
Running as unit: run-rd3cd4df89d4d41dd9784be1e4956a844.service
[root@fedora ~]# ./state.sh
cat /sys/fs/cgroup/cgroup.subtree_control
cpuset memory pids                                                                        <--- if I invoke systemd-run without CPUQuota, the configured cgroup stays intact
cat /sys/fs/cgroup/system.slice/cpuset.cpus

cat /sys/fs/cgroup/user.slice/cpuset.cpus
4-39
cat /sys/fs/cgroup/user.slice/cpuset.cpus.partition
root
cat /sys/fs/cgroup/user.slice/cgroup.subtree_control
cpuset memory pids
cat /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus
5-39
cat /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus.partition
root
[root@fedora ~]#

Comment 3 David Tardon 2021-05-04 16:32:11 UTC
(In reply to Troy Wilson from comment #2)
> >> Steps to Reproduce:
> >> 1. Configure a v2 cgroup that includes cpusets
> >
> > How did you configure it?
> 
> My system has 40 CPUs, I assign 4-39 to user.slice and then 5-39 to
> user-1000.slice
> 
>      echo "+cpuset" > /sys/fs/cgroup/cgroup.subtree_control
>      echo 4-39 > /sys/fs/cgroup/user.slice/cpuset.cpus
>      echo "root" > /sys/fs/cgroup/user.slice/cpuset.cpus.partition
>      echo "+cpuset" > /sys/fs/cgroup/user.slice/cgroup.subtree_control
>      echo 5-39 > /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus
>      echo root >
> /sys/fs/cgroup/user.slice/user-1000.slice/cpuset.cpus.partition

Manual modification of cgroups owned by systemd is not supported. If you want to manage your own subhierarchy, use delegation (https://systemd.io/CGROUP_DELEGATION/).

Comment 4 Waiman Long 2021-05-06 14:52:04 UTC
(In reply to David Tardon from comment #1)
> (In reply to Troy Wilson from comment #0)
> > Description of problem:
> > When using systemd-run on RHEL9 with cgroupv2, systemd will put the value
> > "cpu" into the /sys/fs/cgroup/*/cgroup.subtree_control files, which has the
> > effect of destroying any existing v2 cgroup that is configured.  I don't
> > know if systemd-run should or should not be putting anything in
> > /sys/fs/cgroup/*/cgroup.subtree_control,
> 
> systemd-run just forwards the command (together with supplied options) to
> systemd, which starts it as a transient unit.
> 
>  but if it does, the keyword for
> > cgroupv2 is "cpuset", not "cpu".
> 
> https://github.com/torvalds/linux/blob/master/Documentation/admin-guide/
> cgroup-v2.rst#cpu
> 
> > Steps to Reproduce:
> > 1. Configure a v2 cgroup that includes cpusets
> 
> How did you configure it?

The way cgroup v2 works is that you have to enable the specific controller level-by-level. Echoing "+cpuset" to cgroup.subtree_control, for example, will enable its children to have the cpuset controller enabled. However, the grandchildren won't have cpuset enabled. Each child has to enable it in its cgroup.subtree_control to allow the grandchildren to use cpuset.

Generally speaking, you can enable all the controllers except one in all the cgroups. The exception is the cpu controller because having a nested cpu controller hierarchy will cause some performance degradation. So care must be taken to enable cpu controller. There are engineers upstream trying to fix this problem, but it will probably take a while.

-Longman