Bug 1946801

Summary: Support cpuset.sched_load_balance by changing default CPUset directory structure
Product: Red Hat Enterprise Linux 9 Reporter: Andrew Theurer <atheurer>
Component: kernelAssignee: Waiman Long <llong>
kernel sub component: Control Groups QA Contact: Chunyu Hu <chuhu>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: adrian.fernandez-pello.fernandez, ailan, arozansk, atheurer, berrange, bhu, bmichalo, bperkins, browsell, cye, dcain, dgonyier, dhellmann, djdumas, dphillip, dshaks, eelena, egallen, fbaudin, fdeutsch, fherrman, fromani, jinqi, jlelli, jmario, jskarvad, kchamart, krister, lhuang, llong, longman, mpatel, mschuppe, msivak, mtosatti, nilal, nnarang, pauld, phrdina, rkhan, rolove, rphillips, sgordon, smalleni, trwilson, vcowan, william.caban, williams, xuzhang, yalzhang
Version: 9.0Keywords: Triaged, ZStream
Target Milestone: beta   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-5.14.0-182.el9 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2161105 2161106 (view as bug list) Environment:
Last Closed: 2023-05-09 07:55:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version: 6.1
Embargoed:
Bug Depends On:    
Bug Blocks: 2161105, 2161106    

Description Andrew Theurer 2021-04-06 21:49:49 UTC
Description of problem:

Attempts at using either isolcpus=domain,<cpus> and removal of SCHED_DOMAIN_BALANCE from sched-domains are no possible on RHEL9 due to more recent kernel which does not have these features.

These features are 100% necessary to Telco customers who require very low latency.  Not having this capability in RHEL9 risks low adoption.

In order to provide no-scheduler-balance capability in RHEL9, we need a restructuring of CPUset directory configuration.  While this can be done manually by the user, I believe the default CPUset directory structure should accommodate this capability in order to make this more user-friendly and to have a consistent directory structure which products like Openstack and Openshift can rely on.

Due to the nature of how the load-balance feature works[1], it is important to have the upper-most CPUset directory with cpuset.sched_load_balance=0.  In doing this, we need to provide a sub-directory for a new CPUset which has cpuset.sched_load_balance=1 for general purpose work (anything run by systemd).  By default, all system processes go in this sub-CPUset, and in both of these CPUsets, the cpuset.cpus = all-cpus.  For reference in the BZ, we will call this the "non-isolated CPUset".

In order to facilitate load-balancing disablement and CPU isolation, a third CPUset is created, the "isolated CPUset", directly below the first (upper-most) CPUset.  This third CPUset is at the same "level" as the "non-isolated CPUset", but it has cpuset.sched_load_balance=0. As CPUs are needed for isolated workloads, CPUs are removed from the non-isolated CPUset and added to the isolated CPUset.  To create individual CPU isolations for multiple workloads, multiple subdirectories are created under the non-isolated CPUset, and exclusive CPUs are allocated to each workload's CPUset.  Each workload's CPUset can have cpuset.sched_load_balance= 0 or 1 depending on the workload's requirement.

Agreement on this directory structure is very important, as we potentially need the ability for multiple products to work with this structure, sometimes at the same time (systemd + Openstack-or-Openshift).  Both Openstack and Openshift need to support runtimes (VMs, containers, etc) having a shared resource and non-shared CPU resource, as well as a non-shared resource with no-load-balancing for ultra-low latency.  To do this, components from these products (Nova/libvirt, kubelet/crio) need to understand the concept of two different CPUset directories to work with and manipulate them in such a way to play nicely together.

So, for RHEL, we need some support to make this happen, but before a full commitment, we need consensus from the other products/components that this will work.  Once we can agree on a new base directory structure for this, then we need to find where in RHEL this is best implemented.  For that reason, the components for this BZ may need to be changed.

Furthermore, we should clone this BZ when we agree this is something we will do, a clone for any component that needs work to support the new structure.

[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/cpusets.html?highlight=cpuset#what-is-sched-load-balance

Example of the new default CPUset directory structure:

/sys/fs/cgroup/cpuset/:
cpuset.cpus: <all-cpus>
cpuset.sched_load_balance: 0
tasks:  <should only have per-cpu kernel threads>
./non-isolated/:
    cpuset.cpus: <housekeeping + any unused-isolated-cpus>
    cpuset.sched_load_balance: 1
    ./systemd/:
        cpuset.cpus: <matches systemd's CPUAffinity or housekeeping + any unused-isolated-cpus?>
        tasks: all system services
    ./kubepods.burstable/:
        cpuset.cpus: <housekeeping + any unused-isolated-cpus>
    ./kubepods.besteffort/:
        cpuset.cpus: <housekeeping + any unused-isolated-cpus>
    ./libvirt-shared-VMs/
./isolated/:
    cpuset.sched_load_balance: 0
    cpuset.cpus: <all used-isolated-cpus>
    ./kubepods-guaranteed/:
        ./gpod-1/:
            cpuset.cpus: <this pod’s cpu allocation from isolated-cpus>
            cpuset.sched_load_balance: 0 or 1
    ./libvirt-ded-cpu/:
        ./ded-vm-1/:
            cpuset.cpus: <this VM's cpu allocation from isolated-cpus>
            cpuset.sched_load_balance: 0 or 1



kubelet/container-runtime reqs:
-create new CPUsets for /usr/bin/pod under non-isolated CPUset
-create new CPUsets burstable/besteff pods in non-isolated CPUset
-create new CPUsets for guaranteed pods under isolated CPUset
-adjust cpuset.cpus in non-isolated and isolated CPUs as guaranteed pods increase/decrease
nova/libvirt req:
-create new CPUsets for shared-cpu-VMs under non-isolated CPUset
-create new CPUset guaranteed-cpu VMs' vcpu-threads in isolated CPUset
-create new CPUset guaranteed-cpu VMs' emulator-threads in either isolated or non-isolated CPUset depending on emulator-thread policy.
 -Might use common CPUset for some VMs' emulator threads if user requests emulator thread-pool (although this might be more easily managed as a cpu-count for the emulator-thread pool)
-insert vhost-net threads in appropriate CPUset, non-isolated for shared-VMs and isolated (do we increase cpus?) for dedicated-cpu-VMs.
-adjust cpuset.cpus in non-isolated and isolated CPUs as dedicated-cpu-VMs increase/decrease

Comment 1 Daniel Berrangé 2021-04-07 10:16:39 UTC
From the libvirt side the following points occur to me on first reading...

cpuset.sched_load_balance  attribute doesn't exist in cgroups v2, only v1.  I found some postings with patches adding sched_load_balance to v2, but AFAICT it isn't merged https://lkml.org/lkml/2018/3/21/530
Since v1 is the legacy mode, it feels like a bad plan to be designing and implementing a new solution based on something that only exists in v1.   Is there a plan to fix this gap before RHEL-9 ?

The suggested hierarchy ignores the existing normal practice for cgroups layout in a systemd world. Out of the box it will setup a top level with three subtrees - user.slice, system.slice and machine.slice. IIUC, from a systemd POV these are just suggested defaults and can be changed, however, I think it would be a bad idea to change them because this is the layout that all documentation refers to, and thus what both users and applications will reasonably expect to see.

libvirt  will put all VMs under machine.slice unless the mgmt application takes explicit steps to request they be placed elsewhere. So if machine.slice doesn't exist, this may well cause problems. Note also, that while cgroups v1 allows different controllers to setup distinct process hierarchies, libvirt decided to explicitly *not* support that, and expects to use the same hierarchy with all controllers - cgroups v2 now enforces this too. So we can't do a special hierarchy just for the cpuset controller - we have to consider cgroups as a whole.

Previous experience and feedback has been that flatter cgroups hierarchies lead to better performance. In the distant past, libvirt did include its own application name in the hierarchy, but stopped doing that in order to have a shallower cgroups hierarchy. Instead we expanded the naming convention of .scope units we use so that names libvirt creates are less likely to clash with names used by other apps. IOW, overall I don't think it is a good idea to artifically subdivide the system based on individual application, just to avoid clashes - just pick the scope unit names more carefully. 

Overall my inclination would be to try to stick with the out of the box systemd cgroups hierarchy with machine.slice, user.slice, system.slice, and then just add one additional top level  'isolated.slice'. The various mgmt applications should be free to place their resources immediately below these slices without being forced to further sub-divide per application - any further sub-division should be based on resource control needs, not application groupings.

Comment 2 Juri Lelli 2021-04-09 08:52:48 UTC
Hi,

as a self refresher, I played a bit with cgroup v2 cpusets on a F33 box by taking kernel
documentation as reference:

https://elixir.bootlin.com/linux/latest/source/Documentation/admin-guide/cgroup-v2.rst#L1902

tl;dr; my current understanding is that appropriately using cpuset.cpus.partition(s)
should be sufficient to replicate v1 sched_load_balance disabled exclusive cpusets.

Didn't setup the complete solution, but what follow should give an indication about
feasibility.

# grubby --args "sched_debug" --update-kernel /boot/vmlinuz-5.11.11-200.fc33.x86_64
# reboot

# echo 8 > /proc/sys/kernel/printk
# cd /sys/fs/cgroup/
# echo "+cpuset" > cgroup.subtree_control
# echo 0-3 > system.slice/cpuset.cpus
# echo 4-7 > user.slice/cpuset.cpus
# echo 8-23 > init.scope/cpuset.cpus

# echo root > system.slice/cpuset.cpus.partition
<console output>
[ 2065.086427] CPU0 attaching NULL sched-domain.
[ 2065.109197] CPU1 attaching NULL sched-domain.
[ 2065.131693] CPU2 attaching NULL sched-domain.
[ 2065.151279] CPU3 attaching NULL sched-domain.
[ 2065.170857] CPU4 attaching NULL sched-domain.
[ 2065.190461] CPU5 attaching NULL sched-domain.
[ 2065.210070] CPU6 attaching NULL sched-domain.
[ 2065.229674] CPU7 attaching NULL sched-domain.
[ 2065.249213] CPU8 attaching NULL sched-domain.
[ 2065.268820] CPU9 attaching NULL sched-domain.
[ 2065.288448] CPU10 attaching NULL sched-domain.
[ 2065.308553] CPU11 attaching NULL sched-domain.
[ 2065.328480] CPU12 attaching NULL sched-domain.
[ 2065.348007] CPU13 attaching NULL sched-domain.
[ 2065.367881] CPU14 attaching NULL sched-domain.
[ 2065.387806] CPU15 attaching NULL sched-domain.
[ 2065.408696] CPU16 attaching NULL sched-domain.
[ 2065.429661] CPU17 attaching NULL sched-domain.
[ 2065.450137] CPU18 attaching NULL sched-domain.
[ 2065.470181] CPU19 attaching NULL sched-domain.
[ 2065.490247] CPU20 attaching NULL sched-domain.
[ 2065.510268] CPU21 attaching NULL sched-domain.
[ 2065.530271] CPU22 attaching NULL sched-domain.
[ 2065.549990] CPU23 attaching NULL sched-domain.
[ 2065.570491] CPU4 attaching sched-domain(s):
[ 2065.589234]  domain-0: span=4,16 level=SMT
[ 2065.608129]   groups: 4:{ span=4 }, 16:{ span=16 }
[ 2065.632738]   domain-1: span=4-5,12-17 level=MC
[ 2065.653583]    groups: 4:{ span=4,16 cap=2048 }, 12:{ span=12 cap=1023 }, 13:{ span=13 }, 14:{ span=14 }, 15:{ span=15 }, 5:{ span=5,17 cap=2048 }
[ 2065.712797]    domain-2: span=4-23 level=NUMA
[ 2065.732450]     groups: 4:{ span=4-5,12-17 cap=8191 }, 6:{ span=6-11,18-23 cap=12288 }
[ 2065.768069] CPU5 attaching sched-domain(s):
[ 2065.786815]  domain-0: span=5,17 level=SMT
[ 2065.805171]   groups: 5:{ span=5 }, 17:{ span=17 }
[ 2065.826666]   domain-1: span=4-5,12-17 level=MC
[ 2065.847598]    groups: 5:{ span=5,17 cap=2048 }, 4:{ span=4,16 cap=2048 }, 12:{ span=12 cap=1023 }, 13:{ span=13 }, 14:{ span=14 }, 15:{ span=15 }
[ 2065.908928]    domain-2: span=4-23 level=NUMA
[ 2065.928541]     groups: 4:{ span=4-5,12-17 cap=8191 }, 6:{ span=6-11,18-23 cap=12288 }
[ 2065.964048] CPU6 attaching sched-domain(s):
[ 2065.982847]  domain-0: span=6,18 level=SMT
[ 2066.001195]   groups: 6:{ span=6 }, 18:{ span=18 }
[ 2066.022662]   domain-1: span=6-11,18-23 level=MC
[ 2066.043363]    groups: 6:{ span=6,18 cap=2048 }, 7:{ span=7,19 cap=2048 }, 8:{ span=8,20 cap=2048 }, 9:{ span=9,21 cap=2048 }, 10:{ span=10,22 cap=2048 }, 11:{ span=11,23 cap=2048 }
[ 2066.117136]    domain-2: span=4-23 level=NUMA
[ 2066.138911]     groups: 6:{ span=6-11,18-23 cap=12288 }, 4:{ span=4-5,12-17 cap=8191 }
[ 2066.176251] CPU7 attaching sched-domain(s):
[ 2066.194956]  domain-0: span=7,19 level=SMT
[ 2066.213268]   groups: 7:{ span=7 }, 19:{ span=19 }
[ 2066.234758]   domain-1: span=6-11,18-23 level=MC
[ 2066.255506]    groups: 7:{ span=7,19 cap=2048 }, 8:{ span=8,20 cap=2048 }, 9:{ span=9,21 cap=2048 }, 10:{ span=10,22 cap=2048 }, 11:{ span=11,23 cap=2048 }, 6:{ span=6,18 cap=2048 }
[ 2066.333206]    domain-2: span=4-23 level=NUMA
[ 2066.354844]     groups: 6:{ span=6-11,18-23 cap=12288 }, 4:{ span=4-5,12-17 cap=8191 }
[ 2066.391943] CPU8 attaching sched-domain(s):
[ 2066.411588]  domain-0: span=8,20 level=SMT
[ 2066.430725]   groups: 8:{ span=8 }, 20:{ span=20 }
[ 2066.453314]   domain-1: span=6-11,18-23 level=MC
[ 2066.474790]    groups: 8:{ span=8,20 cap=2048 }, 9:{ span=9,21 cap=2048 }, 10:{ span=10,22 cap=2048 }, 11:{ span=11,23 cap=2048 }, 6:{ span=6,18 cap=2048 }, 7:{ span=7,19 cap=2048 }
[ 2066.550703]    domain-2: span=4-23 level=NUMA
[ 2066.571121]     groups: 6:{ span=6-11,18-23 cap=12288 }, 4:{ span=4-5,12-17 cap=8191 }
[ 2066.608140] CPU9 attaching sched-domain(s):
[ 2066.628276]  domain-0: span=9,21 level=SMT
[ 2066.648919]   groups: 9:{ span=9 }, 21:{ span=21 }
[ 2066.671714]   domain-1: span=6-11,18-23 level=MC
[ 2066.693318]    groups: 9:{ span=9,21 cap=2048 }, 10:{ span=10,22 cap=2048 }, 11:{ span=11,23 cap=2048 }, 6:{ span=6,18 cap=2048 }, 7:{ span=7,19 cap=2048 }, 8:{ span=8,20 cap=2048 }
[ 2066.769741]    domain-2: span=4-23 level=NUMA
[ 2066.790103]     groups: 6:{ span=6-11,18-23 cap=12288 }, 4:{ span=4-5,12-17 cap=8191 }
[ 2066.827133] CPU10 attaching sched-domain(s):
[ 2066.846733]  domain-0: span=10,22 level=SMT
[ 2066.866596]   groups: 10:{ span=10 }, 22:{ span=22 }
[ 2066.890038]   domain-1: span=6-11,18-23 level=MC
[ 2066.911798]    groups: 10:{ span=10,22 cap=2048 }, 11:{ span=11,23 cap=2048 }, 6:{ span=6,18 cap=2048 }, 7:{ span=7,19 cap=2048 }, 8:{ span=8,20 cap=2048 }, 9:{ span=9,21 cap=2048 }
[ 2066.987077]    domain-2: span=4-23 level=NUMA
[ 2067.007529]     groups: 6:{ span=6-11,18-23 cap=12288 }, 4:{ span=4-5,12-17 cap=8191 }
[ 2067.044426] CPU11 attaching sched-domain(s):
[ 2067.064679]  domain-0: span=11,23 level=SMT
[ 2067.084379]   groups: 11:{ span=11 }, 23:{ span=23 }
[ 2067.107717]   domain-1: span=6-11,18-23 level=MC
[ 2067.129682]    groups: 11:{ span=11,23 cap=2048 }, 6:{ span=6,18 cap=2048 }, 7:{ span=7,19 cap=2048 }, 8:{ span=8,20 cap=2048 }, 9:{ span=9,21 cap=2048 }, 10:{ span=10,22 cap=2048 }
[ 2067.208701]    domain-2: span=4-23 level=NUMA
[ 2067.229349]     groups: 6:{ span=6-11,18-23 cap=12288 }, 4:{ span=4-5,12-17 cap=8191 }
[ 2067.266568] CPU12 attaching sched-domain(s):
[ 2067.286574]  domain-0: span=4-5,12-17 level=MC
[ 2067.307429]   groups: 12:{ span=12 cap=1023 }, 13:{ span=13 }, 14:{ span=14 }, 15:{ span=15 }, 5:{ span=5,17 cap=2048 }, 4:{ span=4,16 cap=2048 }
[ 2067.369514]   domain-1: span=4-23 level=NUMA
[ 2067.389672]    groups: 4:{ span=4-5,12-17 cap=8191 }, 6:{ span=6-11,18-23 cap=12288 }
[ 2067.426697] CPU13 attaching sched-domain(s):
[ 2067.446774]  domain-0: span=4-5,12-17 level=MC
[ 2067.467733]   groups: 13:{ span=13 }, 14:{ span=14 }, 15:{ span=15 }, 5:{ span=5,17 cap=2048 }, 4:{ span=4,16 cap=2048 }, 12:{ span=12 cap=1023 }
[ 2067.529509]   domain-1: span=4-23 level=NUMA
[ 2067.549058]    groups: 4:{ span=4-5,12-17 cap=8191 }, 6:{ span=6-11,18-23 cap=12288 }
[ 2067.584511] CPU14 attaching sched-domain(s):
[ 2067.604587]  domain-0: span=4-5,12-17 level=MC
[ 2067.625414]   groups: 14:{ span=14 }, 15:{ span=15 }, 5:{ span=5,17 cap=2048 }, 4:{ span=4,16 cap=2048 }, 12:{ span=12 cap=1023 }, 13:{ span=13 }
[ 2067.690597]   domain-1: span=4-23 level=NUMA
[ 2067.711008]    groups: 4:{ span=4-5,12-17 cap=8191 }, 6:{ span=6-11,18-23 cap=12288 }
[ 2067.747968] CPU15 attaching sched-domain(s):
[ 2067.768180]  domain-0: span=4-5,12-17 level=MC
[ 2067.789272]   groups: 15:{ span=15 }, 5:{ span=5,17 cap=2048 }, 4:{ span=4,16 cap=2048 }, 12:{ span=12 cap=1023 }, 13:{ span=13 }, 14:{ span=14 }
[ 2067.850446]   domain-1: span=4-23 level=NUMA
[ 2067.870364]    groups: 4:{ span=4-5,12-17 cap=8191 }, 6:{ span=6-11,18-23 cap=12288 }
[ 2067.906890] CPU16 attaching sched-domain(s):
[ 2067.926599]  domain-0: span=4,16 level=SMT
[ 2067.945785]   groups: 16:{ span=16 }, 4:{ span=4 }
[ 2067.967915]   domain-1: span=4-5,12-17 level=MC
[ 2067.988738]    groups: 4:{ span=4,16 cap=2048 }, 12:{ span=12 cap=1023 }, 13:{ span=13 }, 14:{ span=14 }, 15:{ span=15 }, 5:{ span=5,17 cap=2048 }
[ 2068.051953]    domain-2: span=4-23 level=NUMA
[ 2068.073232]     groups: 4:{ span=4-5,12-17 cap=8191 }, 6:{ span=6-11,18-23 cap=12288 }
[ 2068.110619] CPU17 attaching sched-domain(s):
[ 2068.130562]  domain-0: span=5,17 level=SMT
[ 2068.151575]   groups: 17:{ span=17 }, 5:{ span=5 }
[ 2068.175826]   domain-1: span=4-5,12-17 level=MC
[ 2068.198735]    groups: 5:{ span=5,17 cap=2048 }, 4:{ span=4,16 cap=2048 }, 12:{ span=12 cap=1023 }, 13:{ span=13 }, 14:{ span=14 }, 15:{ span=15 }
[ 2068.261146]    domain-2: span=4-23 level=NUMA
[ 2068.281962]     groups: 4:{ span=4-5,12-17 cap=8191 }, 6:{ span=6-11,18-23 cap=12288 }
[ 2068.319322] CPU18 attaching sched-domain(s):
[ 2068.339146]  domain-0: span=6,18 level=SMT
[ 2068.358347]   groups: 18:{ span=18 }, 6:{ span=6 }
[ 2068.380633]   domain-1: span=6-11,18-23 level=MC
[ 2068.402806]    groups: 6:{ span=6,18 cap=2048 }, 7:{ span=7,19 cap=2048 }, 8:{ span=8,20 cap=2048 }, 9:{ span=9,21 cap=2048 }, 10:{ span=10,22 cap=2048 }, 11:{ span=11,23 cap=2048 }
[ 2068.478160]    domain-2: span=4-23 level=NUMA
[ 2068.498578]     groups: 6:{ span=6-11,18-23 cap=12288 }, 4:{ span=4-5,12-17 cap=8191 }
[ 2068.535247] CPU19 attaching sched-domain(s):
[ 2068.555382]  domain-0: span=7,19 level=SMT
[ 2068.574604]   groups: 19:{ span=19 }, 7:{ span=7 }
[ 2068.597025]   domain-1: span=6-11,18-23 level=MC
[ 2068.618788]    groups: 7:{ span=7,19 cap=2048 }, 8:{ span=8,20 cap=2048 }, 9:{ span=9,21 cap=2048 }, 10:{ span=10,22 cap=2048 }, 11:{ span=11,23 cap=2048 }, 6:{ span=6,18 cap=2048 }
[ 2068.696622]    domain-2: span=4-23 level=NUMA
[ 2068.719269]     groups: 6:{ span=6-11,18-23 cap=12288 }, 4:{ span=4-5,12-17 cap=8191 }
[ 2068.756469] CPU20 attaching sched-domain(s):
[ 2068.776560]  domain-0: span=8,20 level=SMT
[ 2068.795615]   groups: 20:{ span=20 }, 8:{ span=8 }
[ 2068.818176]   domain-1: span=6-11,18-23 level=MC
[ 2068.839639]    groups: 8:{ span=8,20 cap=2048 }, 9:{ span=9,21 cap=2048 }, 10:{ span=10,22 cap=2048 }, 11:{ span=11,23 cap=2048 }, 6:{ span=6,18 cap=2048 }, 7:{ span=7,19 cap=2048 }
[ 2068.916211]    domain-2: span=4-23 level=NUMA
[ 2068.936660]     groups: 6:{ span=6-11,18-23 cap=12288 }, 4:{ span=4-5,12-17 cap=8190 }
[ 2068.974067] CPU21 attaching sched-domain(s):
[ 2068.994193]  domain-0: span=9,21 level=SMT
[ 2069.013239]   groups: 21:{ span=21 }, 9:{ span=9 }
[ 2069.035623]   domain-1: span=6-11,18-23 level=MC
[ 2069.057614]    groups: 9:{ span=9,21 cap=2048 }, 10:{ span=10,22 cap=2048 }, 11:{ span=11,23 cap=2048 }, 6:{ span=6,18 cap=2048 }, 7:{ span=7,19 cap=2048 }, 8:{ span=8,20 cap=2048 }
[ 2069.133291]    domain-2: span=4-23 level=NUMA
[ 2069.153505]     groups: 6:{ span=6-11,18-23 cap=12288 }, 4:{ span=4-5,12-17 cap=8190 }
[ 2069.191250] CPU22 attaching sched-domain(s):
[ 2069.211923]  domain-0: span=10,22 level=SMT
[ 2069.233571]   groups: 22:{ span=22 }, 10:{ span=10 }
[ 2069.257181]   domain-1: span=6-11,18-23 level=MC
[ 2069.278721]    groups: 10:{ span=10,22 cap=2048 }, 11:{ span=11,23 cap=2048 }, 6:{ span=6,18 cap=2048 }, 7:{ span=7,19 cap=2048 }, 8:{ span=8,20 cap=2048 }, 9:{ span=9,21 cap=2048 }
[ 2069.353692]    domain-2: span=4-23 level=NUMA
[ 2069.374146]     groups: 6:{ span=6-11,18-23 cap=12288 }, 4:{ span=4-5,12-17 cap=8190 }
[ 2069.409813] CPU23 attaching sched-domain(s):
[ 2069.428983]  domain-0: span=11,23 level=SMT
[ 2069.448784]   groups: 23:{ span=23 }, 11:{ span=11 }
[ 2069.472171]   domain-1: span=6-11,18-23 level=MC
[ 2069.493741]    groups: 11:{ span=11,23 cap=2048 }, 6:{ span=6,18 cap=2048 }, 7:{ span=7,19 cap=2048 }, 8:{ span=8,20 cap=2048 }, 9:{ span=9,21 cap=2048 }, 10:{ span=10,22 cap=2048 }
[ 2069.569591]    domain-2: span=4-23 level=NUMA
[ 2069.589904]     groups: 6:{ span=6-11,18-23 cap=12288 }, 4:{ span=4-5,12-17 cap=8190 }
[ 2069.627511] root domain span: 4-23 (max cpu_capacity = 1024)
[ 2069.653886] CPU0 attaching sched-domain(s):
[ 2069.673181]  domain-0: span=0-3 level=MC
[ 2069.690983]   groups: 0:{ span=0 cap=1021 }, 1:{ span=1 cap=1021 }, 2:{ span=2 cap=1021 }, 3:{ span=3 cap=1021 }
[ 2069.740809] CPU1 attaching sched-domain(s):
[ 2069.761390]  domain-0: span=0-3 level=MC
[ 2069.779651]   groups: 1:{ span=1 cap=1021 }, 2:{ span=2 cap=1021 }, 3:{ span=3 cap=1021 }, 0:{ span=0 cap=1021 }
[ 2069.827249] CPU2 attaching sched-domain(s):
[ 2069.846702]  domain-0: span=0-3 level=MC
[ 2069.865039]   groups: 2:{ span=2 cap=1021 }, 3:{ span=3 cap=1021 }, 0:{ span=0 cap=1021 }, 1:{ span=1 cap=1021 }
[ 2069.912910] CPU3 attaching sched-domain(s):
[ 2069.932127]  domain-0: span=0-3 level=MC
[ 2069.950157]   groups: 3:{ span=3 cap=1021 }, 0:{ span=0 cap=1021 }, 1:{ span=1 cap=1021 }, 2:{ span=2 cap=1021 }
[ 2069.997807] root domain span: 0-3 (max cpu_capacity = 1024)
[ 2070.024024] rd 4-23: CPUs do not have asymmetric capacities
[ 2070.050039] rd 0-3: CPUs do not have asymmetric capacities
</console output>

# echo root > user.slice/cpuset.cpus.partition
<console output>
[ 2081.053069] CPU4 attaching NULL sched-domain.
[ 2081.075599] CPU5 attaching NULL sched-domain.
[ 2081.099010] CPU6 attaching NULL sched-domain.
[ 2081.118656] CPU7 attaching NULL sched-domain.
[ 2081.138228] CPU8 attaching NULL sched-domain.
[ 2081.157873] CPU9 attaching NULL sched-domain.
[ 2081.177473] CPU10 attaching NULL sched-domain.
[ 2081.197530] CPU11 attaching NULL sched-domain.
[ 2081.217446] CPU12 attaching NULL sched-domain.
[ 2081.237440] CPU13 attaching NULL sched-domain.
[ 2081.257429] CPU14 attaching NULL sched-domain.
[ 2081.277316] CPU15 attaching NULL sched-domain.
[ 2081.297507] CPU16 attaching NULL sched-domain.
[ 2081.318336] CPU17 attaching NULL sched-domain.
[ 2081.338452] CPU18 attaching NULL sched-domain.
[ 2081.358368] CPU19 attaching NULL sched-domain.
[ 2081.378312] CPU20 attaching NULL sched-domain.
[ 2081.398201] CPU21 attaching NULL sched-domain.
[ 2081.418069] CPU22 attaching NULL sched-domain.
[ 2081.438854] CPU23 attaching NULL sched-domain.
[ 2081.460875] CPU8 attaching sched-domain(s):
[ 2081.480898]  domain-0: span=8,20 level=SMT
[ 2081.501416]   groups: 8:{ span=8 }, 20:{ span=20 }
[ 2081.523910]   domain-1: span=8-11,18-23 level=MC
[ 2081.545515]    groups: 8:{ span=8,20 cap=2048 }, 9:{ span=9,21 cap=2048 }, 10:{ span=10,22 cap=2048 }, 18:{ span=18 }, 19:{ span=19 }, 11:{ span=11,23 cap=2048 }
[ 2081.615300]    domain-2: span=8-23 level=NUMA
[ 2081.636878]     groups: 8:{ span=8-11,18-23 cap=10240 }, 12:{ span=12-17 cap=6143 }
[ 2081.672788] CPU9 attaching sched-domain(s):
[ 2081.692310]  domain-0: span=9,21 level=SMT
[ 2081.711629]   groups: 9:{ span=9 }, 21:{ span=21 }
[ 2081.734382]   domain-1: span=8-11,18-23 level=MC
[ 2081.755557]    groups: 9:{ span=9,21 cap=2048 }, 10:{ span=10,22 cap=2048 }, 18:{ span=18 }, 19:{ span=19 }, 11:{ span=11,23 cap=2048 }, 8:{ span=8,20 cap=2048 }
[ 2081.823112]    domain-2: span=8-23 level=NUMA
[ 2081.843605]     groups: 8:{ span=8-11,18-23 cap=10240 }, 12:{ span=12-17 cap=6143 }
[ 2081.879395] CPU10 attaching sched-domain(s):
[ 2081.899202]  domain-0: span=10,22 level=SMT
[ 2081.918957]   groups: 10:{ span=10 }, 22:{ span=22 }
[ 2081.942458]   domain-1: span=8-11,18-23 level=MC
[ 2081.963996]    groups: 10:{ span=10,22 cap=2048 }, 18:{ span=18 }, 19:{ span=19 }, 11:{ span=11,23 cap=2048 }, 8:{ span=8,20 cap=2048 }, 9:{ span=9,21 cap=2048 }
[ 2082.032156]    domain-2: span=8-23 level=NUMA
[ 2082.052593]     groups: 8:{ span=8-11,18-23 cap=10240 }, 12:{ span=12-17 cap=6143 }
[ 2082.088927] CPU11 attaching sched-domain(s):
[ 2082.111092]  domain-0: span=11,23 level=SMT
[ 2082.132082]   groups: 11:{ span=11 }, 23:{ span=23 }
[ 2082.155729]   domain-1: span=8-11,18-23 level=MC
[ 2082.177219]    groups: 11:{ span=11,23 cap=2048 }, 8:{ span=8,20 cap=2048 }, 9:{ span=9,21 cap=2048 }, 10:{ span=10,22 cap=2048 }, 18:{ span=18 }, 19:{ span=19 }
[ 2082.243382]    domain-2: span=8-23 level=NUMA
[ 2082.263769]     groups: 8:{ span=8-11,18-23 cap=10240 }, 12:{ span=12-17 cap=6143 }
[ 2082.299343] CPU12 attaching sched-domain(s):
[ 2082.319263]  domain-0: span=12-17 level=MC
[ 2082.338481]   groups: 12:{ span=12 cap=1023 }, 13:{ span=13 }, 14:{ span=14 }, 15:{ span=15 }, 16:{ span=16 }, 17:{ span=17 }
[ 2082.391369]   domain-1: span=8-23 level=NUMA
[ 2082.411322]    groups: 12:{ span=12-17 cap=6143 }, 8:{ span=8-11,18-23 cap=10240 }
[ 2082.446530] CPU13 attaching sched-domain(s):
[ 2082.466384]  domain-0: span=12-17 level=MC
[ 2082.485899]   groups: 13:{ span=13 }, 14:{ span=14 }, 15:{ span=15 }, 16:{ span=16 }, 17:{ span=17 }, 12:{ span=12 cap=1023 }
[ 2082.539177]   domain-1: span=8-23 level=NUMA
[ 2082.559124]    groups: 12:{ span=12-17 cap=6143 }, 8:{ span=8-11,18-23 cap=10240 }
[ 2082.594698] CPU14 attaching sched-domain(s):
[ 2082.615391]  domain-0: span=12-17 level=MC
[ 2082.635329]   groups: 14:{ span=14 }, 15:{ span=15 }, 16:{ span=16 }, 17:{ span=17 }, 12:{ span=12 cap=1023 }, 13:{ span=13 }
[ 2082.691369]   domain-1: span=8-23 level=NUMA
[ 2082.711422]    groups: 12:{ span=12-17 cap=6143 }, 8:{ span=8-11,18-23 cap=10240 }
[ 2082.747206] CPU15 attaching sched-domain(s):
[ 2082.766867]  domain-0: span=12-17 level=MC
[ 2082.786033]   groups: 15:{ span=15 }, 16:{ span=16 }, 17:{ span=17 }, 12:{ span=12 cap=1023 }, 13:{ span=13 }, 14:{ span=14 }
[ 2082.838628]   domain-1: span=8-23 level=NUMA
[ 2082.858634]    groups: 12:{ span=12-17 cap=6143 }, 8:{ span=8-11,18-23 cap=10240 }
[ 2082.894158] CPU16 attaching sched-domain(s):
[ 2082.914163]  domain-0: span=12-17 level=MC
[ 2082.933402]   groups: 16:{ span=16 }, 17:{ span=17 }, 12:{ span=12 cap=1023 }, 13:{ span=13 }, 14:{ span=14 }, 15:{ span=15 }
[ 2082.986115]   domain-1: span=8-23 level=NUMA
[ 2083.006007]    groups: 12:{ span=12-17 cap=6143 }, 8:{ span=8-11,18-23 cap=10240 }
[ 2083.041417] CPU17 attaching sched-domain(s):
[ 2083.061327]  domain-0: span=12-17 level=MC
[ 2083.080229]   groups: 17:{ span=17 }, 12:{ span=12 cap=1023 }, 13:{ span=13 }, 14:{ span=14 }, 15:{ span=15 }, 16:{ span=16 }
[ 2083.135330]   domain-1: span=8-23 level=NUMA
[ 2083.156771]    groups: 12:{ span=12-17 cap=6143 }, 8:{ span=8-11,18-23 cap=10240 }
[ 2083.191841] CPU18 attaching sched-domain(s):
[ 2083.214122]  domain-0: span=8-11,18-23 level=MC
[ 2083.235873]   groups: 18:{ span=18 }, 19:{ span=19 }, 11:{ span=11,23 cap=2048 }, 8:{ span=8,20 cap=2048 }, 9:{ span=9,21 cap=2048 }, 10:{ span=10,22 cap=2048 }
[ 2083.303531]   domain-1: span=8-23 level=NUMA
[ 2083.324913]    groups: 8:{ span=8-11,18-23 cap=10240 }, 12:{ span=12-17 cap=6143 }
[ 2083.360203] CPU19 attaching sched-domain(s):
[ 2083.380217]  domain-0: span=8-11,18-23 level=MC
[ 2083.401798]   groups: 19:{ span=19 }, 11:{ span=11,23 cap=2048 }, 8:{ span=8,20 cap=2048 }, 9:{ span=9,21 cap=2048 }, 10:{ span=10,22 cap=2048 }, 18:{ span=18 }
[ 2083.469597]   domain-1: span=8-23 level=NUMA
[ 2083.489616]    groups: 8:{ span=8-11,18-23 cap=10240 }, 12:{ span=12-17 cap=6143 }
[ 2083.525221] CPU20 attaching sched-domain(s):
[ 2083.545898]  domain-0: span=8,20 level=SMT
[ 2083.565089]   groups: 20:{ span=20 }, 8:{ span=8 }
[ 2083.587608]   domain-1: span=8-11,18-23 level=MC
[ 2083.609570]    groups: 8:{ span=8,20 cap=2048 }, 9:{ span=9,21 cap=2048 }, 10:{ span=10,22 cap=2048 }, 18:{ span=18 }, 19:{ span=19 }, 11:{ span=11,23 cap=2048 }
[ 2083.681417]    domain-2: span=8-23 level=NUMA
[ 2083.701928]     groups: 8:{ span=8-11,18-23 cap=10240 }, 12:{ span=12-17 cap=6143 }
[ 2083.738226] CPU21 attaching sched-domain(s):
[ 2083.758316]  domain-0: span=9,21 level=SMT
[ 2083.777380]   groups: 21:{ span=21 }, 9:{ span=9 }
[ 2083.799826]   domain-1: span=8-11,18-23 level=MC
[ 2083.821710]    groups: 9:{ span=9,21 cap=2048 }, 10:{ span=10,22 cap=2048 }, 18:{ span=18 }, 19:{ span=19 }, 11:{ span=11,23 cap=2048 }, 8:{ span=8,20 cap=2048 }
[ 2083.889467]    domain-2: span=8-23 level=NUMA
[ 2083.909734]     groups: 8:{ span=8-11,18-23 cap=10240 }, 12:{ span=12-17 cap=6143 }
[ 2083.946186] CPU22 attaching sched-domain(s):
[ 2083.966254]  domain-0: span=10,22 level=SMT
[ 2083.985820]   groups: 22:{ span=22 }, 10:{ span=10 }
[ 2084.009046]   domain-1: span=8-11,18-23 level=MC
[ 2084.030645]    groups: 10:{ span=10,22 cap=2048 }, 18:{ span=18 }, 19:{ span=19 }, 11:{ span=11,23 cap=2048 }, 8:{ span=8,20 cap=2048 }, 9:{ span=9,21 cap=2048 }
[ 2084.097835]    domain-2: span=8-23 level=NUMA
[ 2084.118194]     groups: 8:{ span=8-11,18-23 cap=10240 }, 12:{ span=12-17 cap=6143 }
[ 2084.154828] CPU23 attaching sched-domain(s):
[ 2084.175535]  domain-0: span=11,23 level=SMT
[ 2084.197452]   groups: 23:{ span=23 }, 11:{ span=11 }
[ 2084.221291]   domain-1: span=8-11,18-23 level=MC
[ 2084.243302]    groups: 11:{ span=11,23 cap=2048 }, 8:{ span=8,20 cap=2048 }, 9:{ span=9,21 cap=2048 }, 10:{ span=10,22 cap=2048 }, 18:{ span=18 }, 19:{ span=19 }
[ 2084.310699]    domain-2: span=8-23 level=NUMA
[ 2084.330518]     groups: 8:{ span=8-11,18-23 cap=10240 }, 12:{ span=12-17 cap=6143 }
[ 2084.367462] root domain span: 8-23 (max cpu_capacity = 1024)
[ 2084.393985] CPU4 attaching sched-domain(s):
[ 2084.413605]  domain-0: span=4-5 level=MC
[ 2084.432036]   groups: 4:{ span=4 }, 5:{ span=5 }
[ 2084.453398]   domain-1: span=4-7 level=NUMA
[ 2084.472466]    groups: 4:{ span=4-5 cap=2048 }, 6:{ span=6-7 cap=2048 }
[ 2084.503541] CPU5 attaching sched-domain(s):
[ 2084.523156]  domain-0: span=4-5 level=MC
[ 2084.541497]   groups: 5:{ span=5 }, 4:{ span=4 }
[ 2084.563147]   domain-1: span=4-7 level=NUMA
[ 2084.582326]    groups: 4:{ span=4-5 cap=2048 }, 6:{ span=6-7 cap=2048 }
[ 2084.613436] CPU6 attaching sched-domain(s):
[ 2084.632344]  domain-0: span=6-7 level=MC
[ 2084.650118]   groups: 6:{ span=6 }, 7:{ span=7 }
[ 2084.671654]   domain-1: span=4-7 level=NUMA
[ 2084.693049]    groups: 6:{ span=6-7 cap=2048 }, 4:{ span=4-5 cap=2048 }
[ 2084.725452] CPU7 attaching sched-domain(s):
[ 2084.745359]  domain-0: span=6-7 level=MC
[ 2084.763677]   groups: 7:{ span=7 }, 6:{ span=6 }
[ 2084.785247]   domain-1: span=4-7 level=NUMA
[ 2084.804951]    groups: 6:{ span=6-7 cap=2048 }, 4:{ span=4-5 cap=2048 }
[ 2084.835754] root domain span: 4-7 (max cpu_capacity = 1024)
[ 2084.861802] rd 8-23: CPUs do not have asymmetric capacities
[ 2084.887977] rd 0-3: CPUs do not have asymmetric capacities
[ 2084.913545] rd 4-7: CPUs do not have asymmetric capacities
</console output>

So, with this we end up with 3 separate sched/root domains.

Load balancing will then be performed inside those domains, but not across.
If one wants effectively isolated cpus, I think it should be a matter of
creating another level (or a different) of hierarchy with single cpus
root partitions.

Comment 3 Daniel Berrangé 2021-04-12 12:51:14 UTC
(In reply to Juri Lelli from comment #2)
> Hi,
> 
> as a self refresher, I played a bit with cgroup v2 cpusets on a F33 box by
> taking kernel
> documentation as reference:
> 
> https://elixir.bootlin.com/linux/latest/source/Documentation/admin-guide/
> cgroup-v2.rst#L1902
> 
> tl;dr; my current understanding is that appropriately using
> cpuset.cpus.partition(s)
> should be sufficient to replicate v1 sched_load_balance disabled exclusive
> cpusets.

snip

> Load balancing will then be performed inside those domains, but not across.
> If one wants effectively isolated cpus, I think it should be a matter of
> creating another level (or a different) of hierarchy with single cpus
> root partitions.

IIUC,  sched_load_balance=0 gives the ability to disable load balancing across a set of CPUs with only a single division of the cpuset hiearchy.

ie I believe it is possible to do something like this:

  cd /sys/fs/cgroups/cpuset
  mkdir isolated.slice
  echo 0-4 > {system,user,machine}.slice/cpuset.cpus
  echo 4-15 > isolated.slice/cpuset.cpus
  echo 0 > isolated.slice/cpuset.sched_load_balance
  cd isolated.slice
  mkdir vm{1,2,3}
  echo 4-7 > vm1/cpuset.cpus
  echo 7-11 > vm2/cpuset.cpus
  echo 12-15 > vm3/cpuset.cpus


to get CPUs 4-15 excluded from load balancing, and then assign 4 CPUs each to three virtual machines.

IIUC, from your description here, fully disabling load balancing requires one level per host CPU, so to get the equivalent we need todo

  cd /sys/fs/cgroups
  mkdir isolated.slice
  echo 0-4 > {system,user,machine}.slice/cpuset.cpus
  echo 4-15 > isolated.slice/cpuset.cpus
  mkdir vm{1,2,3}
  mkdir vm{1,2,3}/vcpu{0,1,2,3}

  echo 4 > vm1/vcpu0/cpuset.cpus
  echo 5 > vm1/vcpu1/cpuset.cpus
  echo 6 > vm1/vcpu2/cpuset.cpus
  echo 7 > vm1/vcpu3/cpuset.cpus

  echo 8 > vm2/vcpu0/cpuset.cpus
  echo 9 > vm2/vcpu1/cpuset.cpus
  echo 10 > vm2/vcpu2/cpuset.cpus
  echo 11 > vm2/vcpu3/cpuset.cpus

  echo 12 > vm3/vcpu0/cpuset.cpus
  echo 13 > vm3/vcpu1/cpuset.cpus
  echo 14 > vm3/vcpu2/cpuset.cpus
  echo 15 > vm3/vcpu3/cpuset.cpus

  for i in vm{1,2,3}/vcpu{0,1,2,3}
  do
     echo root > $i/cpuset.cpus.partition
  done


QEMU will be in isolated.slice/vmNNN for all controllers, except cpuset. For cpuset, we'll need to put each QEMU vCPU thread into a isolated.slice/vmNNN/cpuMMM

This isn't as terrible as it sounds, since libvirt already in fact creates one sub-dir in cgroups for each guest vCPU so we can do per-vCPU accounting.

Comment 4 Pavel Hrdina 2021-04-12 16:40:09 UTC
(In reply to Daniel Berrangé from comment #3)
> (In reply to Juri Lelli from comment #2)
> > Hi,
> > 
> > as a self refresher, I played a bit with cgroup v2 cpusets on a F33 box by
> > taking kernel
> > documentation as reference:
> > 
> > https://elixir.bootlin.com/linux/latest/source/Documentation/admin-guide/
> > cgroup-v2.rst#L1902
> > 
> > tl;dr; my current understanding is that appropriately using
> > cpuset.cpus.partition(s)
> > should be sufficient to replicate v1 sched_load_balance disabled exclusive
> > cpusets.
> 
> snip
> 
> > Load balancing will then be performed inside those domains, but not across.
> > If one wants effectively isolated cpus, I think it should be a matter of
> > creating another level (or a different) of hierarchy with single cpus
> > root partitions.
> 
> IIUC,  sched_load_balance=0 gives the ability to disable load balancing
> across a set of CPUs with only a single division of the cpuset hiearchy.
> 
> ie I believe it is possible to do something like this:
> 
>   cd /sys/fs/cgroups/cpuset
>   mkdir isolated.slice
>   echo 0-4 > {system,user,machine}.slice/cpuset.cpus
>   echo 4-15 > isolated.slice/cpuset.cpus
>   echo 0 > isolated.slice/cpuset.sched_load_balance
>   cd isolated.slice
>   mkdir vm{1,2,3}
>   echo 4-7 > vm1/cpuset.cpus
>   echo 7-11 > vm2/cpuset.cpus
>   echo 12-15 > vm3/cpuset.cpus

Unfortunately this will not work. Looking at kernel documentation [1] specifically this paragraph:

"So, for example, if the top cpuset has the flag “cpuset.sched_load_balance” enabled, then the scheduler will have one sched domain covering all CPUs, and the setting of the “cpuset.sched_load_balance” flag in any other cpusets won’t matter, as we’re already fully load balancing."

kernel will simply ignore disabling the load balancer in "isolated.slice" and it doesn't produce any error.

> to get CPUs 4-15 excluded from load balancing, and then assign 4 CPUs each
> to three virtual machines.
> 
> IIUC, from your description here, fully disabling load balancing requires
> one level per host CPU, so to get the equivalent we need todo
> 
>   cd /sys/fs/cgroups
>   mkdir isolated.slice
>   echo 0-4 > {system,user,machine}.slice/cpuset.cpus
>   echo 4-15 > isolated.slice/cpuset.cpus
>   mkdir vm{1,2,3}
>   mkdir vm{1,2,3}/vcpu{0,1,2,3}
> 
>   echo 4 > vm1/vcpu0/cpuset.cpus
>   echo 5 > vm1/vcpu1/cpuset.cpus
>   echo 6 > vm1/vcpu2/cpuset.cpus
>   echo 7 > vm1/vcpu3/cpuset.cpus
> 
>   echo 8 > vm2/vcpu0/cpuset.cpus
>   echo 9 > vm2/vcpu1/cpuset.cpus
>   echo 10 > vm2/vcpu2/cpuset.cpus
>   echo 11 > vm2/vcpu3/cpuset.cpus
>
>   echo 12 > vm3/vcpu0/cpuset.cpus
>   echo 13 > vm3/vcpu1/cpuset.cpus
>   echo 14 > vm3/vcpu2/cpuset.cpus
>   echo 15 > vm3/vcpu3/cpuset.cpus
> 
>   for i in vm{1,2,3}/vcpu{0,1,2,3}
>   do
>      echo root > $i/cpuset.cpus.partition
>   done

This will not work as well. Reading kernel documentation [2] specifically this paragraph:

"A parent partition cannot distribute all its CPUs to its child partitions. There must be at least one cpu left in the parent partition."

This would transfer to this topology:

/sys/fs/cgroup
    machine.slice/
        cpuset.cpus 0-2
    system.slice/
        cpuset.cpus 0-2
    user.slice/
        cpuset.cpus 0-2
    isolated.slice/
        cpuset.cpus 3-15
        cpuset.cpus.partition root
        vm1/
            cpuset.cpus 4-7
            cpuset.cpus.partition root
            emulator/
            vcpu0/
                cpuset.cpus 5
            vcpu1/
                cpuset.cpus 6
            vcpu2/
                cpuset.cpus 7
        vm2/
            cpuset.cpus 8-11
            cpuset.cpus.partition root
            emulator/
            vcpu0/
                cpuset.cpus 9
            vcpu1/
                cpuset.cpus 10
            vcpu2/
                cpuset.cpus 11
        vm3/
            cpuset.cpus 12-15
            cpuset.cpus.partition root
            emulator/
            vcpu0/
                cpuset.cpus 13
            vcpu1/
                cpuset.cpus 14
            vcpu2/
                cpuset.cpus 15


Looking at the example the limitation is obvious, every child-cgroup with cpuset.cpus.partition set to root consumes 1 host CPU and creates number_of_vms * (number_of_vcpus + 1) + 1 scheduling domains.

The question now is how effectife it is and how much of overhead will so many scheduling domains create?

In addition if users would like/need to have more nested topology that would consume some additional host CPUs which would limit the number of possible VMs running on that host compared to using "isolcpus=".

With the topology above it would consume only 1 additional host CPU to cover the "isolated.slice" cgroup.


> QEMU will be in isolated.slice/vmNNN for all controllers, except cpuset. For
> cpuset, we'll need to put each QEMU vCPU thread into a
> isolated.slice/vmNNN/cpuMMM
> 
> This isn't as terrible as it sounds, since libvirt already in fact creates
> one sub-dir in cgroups for each guest vCPU so we can do per-vCPU accounting.

Would it be possible to have the logic inverted as in cgroups v1 so it would behave similarly as "isocpus=" on kernel cmdline?
That way the solution Daniel suggested for cgroups v1 would work and would be fairly simple to implement in libvirt
and having the same option in cgroups v2 would be ideal as well.

[1] <https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v1/cpusets.html#what-is-sched-load-balance>
[2] <https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#cpuset-interface-files>

Comment 5 Juri Lelli 2021-04-12 16:54:55 UTC
(In reply to Pavel Hrdina from comment #4)
> (In reply to Daniel Berrangé from comment #3)
> > (In reply to Juri Lelli from comment #2)
> > > Hi,
> > > 
> > > as a self refresher, I played a bit with cgroup v2 cpusets on a F33 box by
> > > taking kernel
> > > documentation as reference:
> > > 
> > > https://elixir.bootlin.com/linux/latest/source/Documentation/admin-guide/
> > > cgroup-v2.rst#L1902
> > > 
> > > tl;dr; my current understanding is that appropriately using
> > > cpuset.cpus.partition(s)
> > > should be sufficient to replicate v1 sched_load_balance disabled exclusive
> > > cpusets.
> > 
> > snip
> > 
> > > Load balancing will then be performed inside those domains, but not across.
> > > If one wants effectively isolated cpus, I think it should be a matter of
> > > creating another level (or a different) of hierarchy with single cpus
> > > root partitions.
> > 
> > IIUC,  sched_load_balance=0 gives the ability to disable load balancing
> > across a set of CPUs with only a single division of the cpuset hiearchy.
> > 
> > ie I believe it is possible to do something like this:
> > 
> >   cd /sys/fs/cgroups/cpuset
> >   mkdir isolated.slice
> >   echo 0-4 > {system,user,machine}.slice/cpuset.cpus
> >   echo 4-15 > isolated.slice/cpuset.cpus
> >   echo 0 > isolated.slice/cpuset.sched_load_balance
> >   cd isolated.slice
> >   mkdir vm{1,2,3}
> >   echo 4-7 > vm1/cpuset.cpus
> >   echo 7-11 > vm2/cpuset.cpus
> >   echo 12-15 > vm3/cpuset.cpus
> 
> Unfortunately this will not work. Looking at kernel documentation [1]
> specifically this paragraph:
> 
> "So, for example, if the top cpuset has the flag “cpuset.sched_load_balance”
> enabled, then the scheduler will have one sched domain covering all CPUs,
> and the setting of the “cpuset.sched_load_balance” flag in any other cpusets
> won’t matter, as we’re already fully load balancing."
> 
> kernel will simply ignore disabling the load balancer in "isolated.slice"
> and it doesn't produce any error.
> 
> > to get CPUs 4-15 excluded from load balancing, and then assign 4 CPUs each
> > to three virtual machines.
> > 
> > IIUC, from your description here, fully disabling load balancing requires
> > one level per host CPU, so to get the equivalent we need todo
> > 
> >   cd /sys/fs/cgroups
> >   mkdir isolated.slice
> >   echo 0-4 > {system,user,machine}.slice/cpuset.cpus
> >   echo 4-15 > isolated.slice/cpuset.cpus
> >   mkdir vm{1,2,3}
> >   mkdir vm{1,2,3}/vcpu{0,1,2,3}
> > 
> >   echo 4 > vm1/vcpu0/cpuset.cpus
> >   echo 5 > vm1/vcpu1/cpuset.cpus
> >   echo 6 > vm1/vcpu2/cpuset.cpus
> >   echo 7 > vm1/vcpu3/cpuset.cpus
> > 
> >   echo 8 > vm2/vcpu0/cpuset.cpus
> >   echo 9 > vm2/vcpu1/cpuset.cpus
> >   echo 10 > vm2/vcpu2/cpuset.cpus
> >   echo 11 > vm2/vcpu3/cpuset.cpus
> >
> >   echo 12 > vm3/vcpu0/cpuset.cpus
> >   echo 13 > vm3/vcpu1/cpuset.cpus
> >   echo 14 > vm3/vcpu2/cpuset.cpus
> >   echo 15 > vm3/vcpu3/cpuset.cpus
> > 
> >   for i in vm{1,2,3}/vcpu{0,1,2,3}
> >   do
> >      echo root > $i/cpuset.cpus.partition
> >   done
> 
> This will not work as well. Reading kernel documentation [2] specifically
> this paragraph:
> 
> "A parent partition cannot distribute all its CPUs to its child partitions.
> There must be at least one cpu left in the parent partition."
> 
> This would transfer to this topology:
> 
> /sys/fs/cgroup
>     machine.slice/
>         cpuset.cpus 0-2
>     system.slice/
>         cpuset.cpus 0-2
>     user.slice/
>         cpuset.cpus 0-2
>     isolated.slice/
>         cpuset.cpus 3-15
>         cpuset.cpus.partition root
>         vm1/
>             cpuset.cpus 4-7
>             cpuset.cpus.partition root
>             emulator/
>             vcpu0/
>                 cpuset.cpus 5
>             vcpu1/
>                 cpuset.cpus 6
>             vcpu2/
>                 cpuset.cpus 7
>         vm2/
>             cpuset.cpus 8-11
>             cpuset.cpus.partition root
>             emulator/
>             vcpu0/
>                 cpuset.cpus 9
>             vcpu1/
>                 cpuset.cpus 10
>             vcpu2/
>                 cpuset.cpus 11
>         vm3/
>             cpuset.cpus 12-15
>             cpuset.cpus.partition root
>             emulator/
>             vcpu0/
>                 cpuset.cpus 13
>             vcpu1/
>                 cpuset.cpus 14
>             vcpu2/
>                 cpuset.cpus 15
> 
> 
> Looking at the example the limitation is obvious, every child-cgroup with
> cpuset.cpus.partition set to root consumes 1 host CPU and creates
> number_of_vms * (number_of_vcpus + 1) + 1 scheduling domains.

An alternative would be to flatten the hierarchy, I'm thinking, e.g.

 isolated.slice/
   vm1-vcpu{0,1,2}
      cpuset.cpus.partition root 
      cpuset.cpus {4,5,6}
   vm2-vcpu{0,1,2}
      cpuset.cpus.partition root 
      cpuset.cpus {7,8,9}
   vm3-vcpu{0,1,2}
      cpuset.cpus.partition root 
      cpuset.cpus {10,11,12}
   etc. (same for emulators cpusets)

It should be "saving" cpus and, in general, deep cgroups hierarchies might
introduce overhead. Not sure it is flexible enough, though.

Comment 6 Daniel Berrangé 2021-04-12 17:01:10 UTC
(In reply to Juri Lelli from comment #5)
> An alternative would be to flatten the hierarchy, I'm thinking, e.g.
> 
>  isolated.slice/
>    vm1-vcpu{0,1,2}
>       cpuset.cpus.partition root 
>       cpuset.cpus {4,5,6}
>    vm2-vcpu{0,1,2}
>       cpuset.cpus.partition root 
>       cpuset.cpus {7,8,9}
>    vm3-vcpu{0,1,2}
>       cpuset.cpus.partition root 
>       cpuset.cpus {10,11,12}
>    etc. (same for emulators cpusets)
> 
> It should be "saving" cpus and, in general, deep cgroups hierarchies might
> introduce overhead. Not sure it is flexible enough, though.

That's not viable for libvirt - we need to have 1 top level cgroup per-VM, because we use cgroups for more than just the "cpuset" controller, and need to have the same hierarchy for all controllers, and we already need to have one 2nd level cgroup per vCPU for existing placement and accounting functionality we use.

Comment 7 Pavel Hrdina 2021-04-12 17:39:36 UTC
(In reply to Daniel Berrangé from comment #6)
> (In reply to Juri Lelli from comment #5)
> > An alternative would be to flatten the hierarchy, I'm thinking, e.g.
> > 
> >  isolated.slice/
> >    vm1-vcpu{0,1,2}
> >       cpuset.cpus.partition root 
> >       cpuset.cpus {4,5,6}
> >    vm2-vcpu{0,1,2}
> >       cpuset.cpus.partition root 
> >       cpuset.cpus {7,8,9}
> >    vm3-vcpu{0,1,2}
> >       cpuset.cpus.partition root 
> >       cpuset.cpus {10,11,12}
> >    etc. (same for emulators cpusets)
> > 
> > It should be "saving" cpus and, in general, deep cgroups hierarchies might
> > introduce overhead. Not sure it is flexible enough, though.
> 
> That's not viable for libvirt - we need to have 1 top level cgroup per-VM,
> because we use cgroups for more than just the "cpuset" controller, and need
> to have the same hierarchy for all controllers, and we already need to have
> one 2nd level cgroup per vCPU for existing placement and accounting
> functionality we use.

To make things a bit more complicated on systemd hosts we need an additional level for each VM. The main reason is that we use systemd-machined to create the VM root cgroup for us and how cgroup delegation works. See [1] for more details.
The other solution was to implement everything that libvirt needs in systemd and use only systemd DBus APIs to configure cgroups.

[1] <https://gitlab.com/libvirt/libvirt/-/commit/184245f53b94fc84f727eb6e8a2aa52df02d69c0>

Comment 8 Marcelo Tosatti 2021-04-13 13:52:27 UTC
For VMs and the KVM-RT use-case, nohz_full= is used (and nohz_full should be activated
when poll-type applications execute).  Therefore there should be no scheduler tick
(so load balancing overhead is not an issue, at least for the isolated cpus).

Comment 9 Phil Auld 2021-04-14 17:08:38 UTC
This is a ticket that deals with making cgroupv2 the default in rhel-9.  I don't think it's fully decided.

  https://issues.redhat.com/browse/RHELBU-197

I did a little experimenting with upstream kernels. From what I can tell the "isolcpus=domain,<cpulist>" is not going away and works as expected. No sched domains are created
that include those cpus and so there is no load balancing. 

The real change here is the tuned cpu-partitioning profile which uses the domain flags (as a separately configured list of CPUs). I guess that was there for CPUs that are not otherwise isolated since there is no flags field for isolated CPUs.

Layered products should still be able to use the static boot time isolation setup for rhel-9.  Making that dynamic is a future feature but beyond the scope of this bz.

Comment 10 Phil Auld 2021-04-20 14:33:37 UTC
I was mistaken about what tuned does. So yes, it would likely need to change to use isolcpus=domain for the no load balance cores setting. 

There is a bz for tuned about this, which I thought I had linked here already but didn't see in a quick scan: Bz1874596

Comment 11 Waiman Long 2021-04-22 19:13:54 UTC
I will like to chime in on this topic of cpuset v2. I am the one that submitted the cpuset v2 patch upstream. I originally want to keep the load_balancing flag as noted in comment #1. However, it was rejected by Peter as people who wanted to use isolated cpus could use the "isolcpus" command line option which work removing cpus that are available in the cpuset of the root cgroup.

The partition feature of cpuset was added to allow people to set up separate sched-domains. This is somewhat similar to what you can do with v1's load_balance and exclusive flags. In v1, however, you have to disable load balance at the root cgroup before you can create separate sched-domains in the child sub-directory level. Disabling scheduling at the root cgroup level can be problematic. The partition feature allows users to create sched-domains in a cleaner way.

The reason why the parent of a partition has to have at least one cpu is because the parent itself will also be a separate sched-domain and we cannot have sched domain without cpu associated with it. It is certainly not possible for the root cgroup. However, it can be argued that we may allow that in a non-root parent partition cgroup as long as there is no task associated with this cgroup. That does require more checking in the cpuset code to make sure that this is really the case. Seeing no such need at that time, this was not done. So if there is a good use case for a parent partition with no cpu, we may be able to relax this limitation.

Finally, the only way to disable loading balance with cpuset v2 is to create a number of 1-cpu partitions.

-Longman

Comment 12 Phil Auld 2021-05-03 14:38:30 UTC
I went ahead and marked this triaged. I'm not sure it's going to end up as a scheduler bug though. I suspect it should get moved, but I'm not sure exactly to where...

Comment 13 Waiman Long 2021-05-03 14:55:47 UTC
(In reply to Phil Auld from comment #12)
> I went ahead and marked this triaged. I'm not sure it's going to end up as a
> scheduler bug though. I suspect it should get moved, but I'm not sure
> exactly to where...

You can move this BZ under control group. It is not really a scheduler problem.

However, I would like to emphasize that it is not clear to me what exactly the submitter wants from the kernel. So I am sure what we can do.

-Longman

Comment 14 Waiman Long 2021-05-20 14:37:42 UTC
So what exactly is the requirement for cpuset v2 in RHEL9? Just adding a load_balance flag like that in cgroup v1 is probably not going to fly.

-Longman

Comment 15 Martin Sivák 2021-05-20 14:49:05 UTC
(In reply to Waiman Long from comment #11)
> as people who wanted to use isolated cpus could use the "isolcpus"
> command line option which work removing cpus that are available in the
> cpuset of the root cgroup.


(In reply to Phil Auld from comment #9)
> Layered products should still be able to use the static boot time isolation
> setup for rhel-9.  Making that dynamic is a future feature but beyond the
> scope of this bz.


Sadly, this not an option for OpenShift. The kubernetes cpu manager logic is dynamic. It allocates cpus for containers as they are started and reclaims them when then end. Static partitioning does not really work for this use case.

Comment 16 Waiman Long 2021-05-20 14:57:45 UTC
(In reply to Martin Sivák from comment #15)
> (In reply to Waiman Long from comment #11)
> > as people who wanted to use isolated cpus could use the "isolcpus"
> > command line option which work removing cpus that are available in the
> > cpuset of the root cgroup.
> 
> 
> (In reply to Phil Auld from comment #9)
> > Layered products should still be able to use the static boot time isolation
> > setup for rhel-9.  Making that dynamic is a future feature but beyond the
> > scope of this bz.
> 
> 
> Sadly, this not an option for OpenShift. The kubernetes cpu manager logic is
> dynamic. It allocates cpus for containers as they are started and reclaims
> them when then end. Static partitioning does not really work for this use
> case.

So OpenShift wants the ability to dynamically disable load balancing within a partition. Right? If it is the case, we need a good use case to convince upstream that it is a feature that is worth to have.

Cheers,
LOngman

Comment 17 Martin Sivák 2021-05-21 13:38:57 UTC
I just double checked what we do and OCP disables load balancing per-cpu using /proc/sys/kernel/sched_domain/cpuX/.../flags (bit 0). Will that keep working?

Comment 18 Juri Lelli 2021-05-21 13:42:29 UTC
(In reply to Martin Sivák from comment #17)
> I just double checked what we do and OCP disables load balancing per-cpu
> using /proc/sys/kernel/sched_domain/cpuX/.../flags (bit 0). Will that keep
> working?

Nope. sched_domains flags are read only upstream now (so they'll be in RHEL9
as well).

Comment 19 Phil Auld 2021-05-21 13:44:15 UTC
And the load balance flag and referencing code has been removed.

Comment 20 Marcelo Tosatti 2021-05-21 14:59:48 UTC
(In reply to Waiman Long from comment #16)
> (In reply to Martin Sivák from comment #15)
> > (In reply to Waiman Long from comment #11)
> > > as people who wanted to use isolated cpus could use the "isolcpus"
> > > command line option which work removing cpus that are available in the
> > > cpuset of the root cgroup.
> > 
> > 
> > (In reply to Phil Auld from comment #9)
> > > Layered products should still be able to use the static boot time isolation
> > > setup for rhel-9.  Making that dynamic is a future feature but beyond the
> > > scope of this bz.
> > 
> > 
> > Sadly, this not an option for OpenShift. The kubernetes cpu manager logic is
> > dynamic. It allocates cpus for containers as they are started and reclaims
> > them when then end. Static partitioning does not really work for this use
> > case.
> 
> So OpenShift wants the ability to dynamically disable load balancing within
> a partition. Right? If it is the case, we need a good use case to convince
> upstream that it is a feature that is worth to have.
> 
> Cheers,
> LOngman

Two points:

A) 

"Description of problem:

Attempts at using either isolcpus=domain,<cpus> and removal of SCHED_DOMAIN_BALANCE from sched-domains are no possible on RHEL9 due to more recent kernel which does not have these features.

These features are 100% necessary to Telco customers who require very low latency.  Not having this capability in RHEL9 risks low adoption."

With nohz_full=, given that the conditions to enter NOHZ_FULL mode on the isolated CPUs are met:

    1) A single SCHED_OTHER process is on the runqueue.
    2) A SCHED_FIFO process is running.

(and the fixes to maintain CPU with scheduler tick disabled are backported: https://bugzilla.redhat.com/show_bug.cgi?id=1962632), 
what is the problem of load balancing being enabled?

B)

Even if for some unknown reason the scheduler tick is enabled, with the cpumask of all threads in the system, relative
to isolated CPU 1 (for example) being either:

    1) task pinned to isolated CPU 1: cpumask=0x2
    2) any other task in the system:  cpumask has bit 1 _clear_.

it should be possible for the kernel to detect that moving tasks to/from CPU 1
is not possible and therefore load balancing can be automatically disabled. 

Or am i missing something?

Comment 21 Joe Mario 2021-05-21 15:10:01 UTC
Hi Marcelo:
 > Attempts at using either isolcpus=domain,<cpus> and removal of SCHED_DOMAIN_BALANCE from sched-domains are no possible on RHEL9 due to more recent kernel which does not have these features.

Yes, the removal of SCHED_DOMAIN_BALANCE is a big problem needing to be solved.

But I thought that isolcpus=domain,<cpus> will be there for RHEL-9, even though it's been listed as deprecated.
Are there new plans to remove it?

Joe

Comment 22 Phil Auld 2021-05-21 15:19:53 UTC
Joe, no isolcpus=domain is not going away.  But that does not work for the dynamic OCP cases.

Marcelo, yes, I agree that if B is true then tasks should not end up on the isolated cpus. Is B true in these use cases? 

I think that the issue is that even without tasks actually being placed wrong the kernel at times would run the load balancing code (maybe when going idle for example) on the isolated cpus and that by itself would cause latency because it could be a heavy weight operation as it tries to look for a task it can pull over (even though affinities would cause it to fail in the end).  This may be better in current kernels that it has been in the past, but I think it is still possible. 

At least that's my understanding.

Comment 23 Marcelo Tosatti 2021-05-21 15:37:18 UTC
(In reply to Phil Auld from comment #22)
> Joe, no isolcpus=domain is not going away.  But that does not work for the
> dynamic OCP cases.
> 
> Marcelo, yes, I agree that if B is true then tasks should not end up on the
> isolated cpus. Is B true in these use cases? 

Phil,

For the FlexRAN/DPDK workload types, yes.

Consider isolated CPU X. For all tasks that have cpumask with bit X set,
only one is going to switch to runnable STATE (and that is the task supposed
to be isolated on that CPU).

> I think that the issue is that even without tasks actually being placed
> wrong the kernel at times would run the load balancing code (maybe when
> going idle for example) on the isolated cpus and that by itself would cause
> latency because it could be a heavy weight operation as it tries to look for
> a task it can pull over (even though affinities would cause it to fail in
> the end).  This may be better in current kernels that it has been in the
> past, but I think it is still possible. 

Yes.

> 
> At least that's my understanding.

Mine as well.

Comment 27 Andrew Theurer 2021-05-25 01:11:14 UTC
(In reply to Marcelo Tosatti from comment #20)

> Two points:
> 
> A) 
> 
> "Description of problem:
> 
> Attempts at using either isolcpus=domain,<cpus> and removal of
> SCHED_DOMAIN_BALANCE from sched-domains are no possible on RHEL9 due to more
> recent kernel which does not have these features.
> 
> These features are 100% necessary to Telco customers who require very low
> latency.  Not having this capability in RHEL9 risks low adoption."
> 
> With nohz_full=, given that the conditions to enter NOHZ_FULL mode on the
> isolated CPUs are met:
> 
>     1) A single SCHED_OTHER process is on the runqueue.
>     2) A SCHED_FIFO process is running.
> 
> (and the fixes to maintain CPU with scheduler tick disabled are backported:
> https://bugzilla.redhat.com/show_bug.cgi?id=1962632), 
> what is the problem of load balancing being enabled?

If the is truly no timer at all (not even 1Hz), then you are correct, there is not a problem here.

> 
> B)
> 
> Even if for some unknown reason the scheduler tick is enabled, with the
> cpumask of all threads in the system, relative
> to isolated CPU 1 (for example) being either:
> 
>     1) task pinned to isolated CPU 1: cpumask=0x2
>     2) any other task in the system:  cpumask has bit 1 _clear_.
> 
> it should be possible for the kernel to detect that moving tasks to/from CPU
> 1
> is not possible and therefore load balancing can be automatically disabled.

Yes, I would hope that could be the case.  However, it is not the action of a task migration that is a problem.  That will never happen for the reason you mention.  It is the time it takes to discover that there are no other tasks with the bit in the cpus_allowed mask present.

This is sometimes a very short period of time because the scheduler will only look at a subset if run-queues.  However, less frequently it may scan all runqueues in the host, and that's when the search for a task to migrate takes a long time.

Comment 28 Andrew Theurer 2021-05-25 01:30:37 UTC
I feel I need to clarify what this BZ was originally for: It was to reduce the maximum time spent in a timer interrupt, due to a sometimes long time to *search* to a candidate task to pull over to a CPU that was otherwise running just 1 sched_fifo task, with that task's requirement to not be interrupted for more than ~20 usec.

The requirement about restricting initial placement of a container's first process is not a scheduler problem, and disabling load-balancing to achieve that is IMO not a good idea.  To achieve this [new, different] customer requirement, we need (a) a set of dedicated CPUs that anything inside the container can task-set to, and (b) When the user requests: ensure the initial process is taskset to a subset if a's CPUs (container-houskeeping-cpus), preferably the number of CPUs chosen by the user. 

Providing "a" is already done with CPU-manager via CPUsets.  Providing "b" should be left up to whatever executes the first process for the container (crio?).  As the initial process runs, whatever it forks/execs/threads will stay on the same CPUs as the parent.  Only when a child task calls sched_set_affinity can it be migrated to a CPU (from a cpumask it provides, which must be same or less than the CPUset's cpulist). The cpumask the child uses just needs to be outside of the container-houskeeping cpus and contain 1 or more cpus from the isolated-container cpus.

The only other thing I see as necessary is: whatever is running in the container, it needs to know the difference between its current cpus_allowed mask, and what all of the cpus it can really run on (all the cpus for this CPUset).  Just looking at cpus_allowed to know where you can run is not enough, because it has shrunk to only the container-housekeeping cpus.  The task need to query CPUset to know what other cpus are available.

Comment 61 errata-xmlrpc 2023-05-09 07:55:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2023:2458