* Description of problem: Starting <cpu ... <numa> guest domains on systems with unavailable cpu resources causes libvirt to get into an unworkable state reporting error: Unable to write to '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2d3\x2dvnuma.scope/vcpu0/cpuset.cpus': Permission denied It seems from not possible regain functionality by adding the missing cpu resources and restart libvirt. * Version-Release number of selected component (if applicable): Latest Fedora (from F26, F27, F28, ... CentOS, RHT?) * How reproducible: 100% * Steps to Reproduce: 1. Administrate a KVM guest domain based on NUMA h/w with below pinning, (specific) numa topology added. This is a 4-node NUMA setup with 120 cpus. See the vcpupin detail below to understand its topology. We simply setup a mini NUMA architecture for 5 vcpus where 0,4 are set to hold affinity with NODE0, 1 with NODE1, 2 with NODE2 and 3 with NODE3. <domain type='kvm'> <name>vnuma</name> ... <vcpu placement='static' current='5'>8</vcpu> <cputune> <vcpupin vcpu='0' cpuset='0-14,60-74'/> <vcpupin vcpu='1' cpuset='15-29,75-89'/> <vcpupin vcpu='2' cpuset='30-44,90-104'/> <vcpupin vcpu='3' cpuset='45-59,105-119'/> <vcpupin vcpu='4' cpuset='0-14,60-74'/> </cputune> ... <cpu mode='host-model' check='none'> <model fallback='allow'/> <topology sockets='4' cores='1' threads='2'/> <numa> <cell id='0' cpus='0,4' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> <sibling id='2' value='31'/> <sibling id='3' value='21'/> </distances> </cell> <cell id='1' cpus='1' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='21'/> <sibling id='1' value='10'/> <sibling id='2' value='21'/> <sibling id='3' value='31'/> </distances> </cell> <cell id='2' cpus='2' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='31'/> <sibling id='1' value='21'/> <sibling id='2' value='10'/> <sibling id='3' value='21'/> </distances> </cell> <cell id='3' cpus='3' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='21'/> <sibling id='1' value='31'/> <sibling id='2' value='21'/> <sibling id='3' value='10'/> </distances> </cell> </numa> </cpu> .. </domain> Before going to step 2. test if the above works. It should. You should be able to create a guest domain no based on 4 NUMA administrated cells. So, simply said (below should work prior continuing); sudo virsh start vnuma sudo virsh shutdown vnuma 2. Have the 'vnuma' guest domain shutdown. Take all physical node NUMA cpu's offline apart from those running under NODE0. # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 120 On-line CPU(s) list: 0-119 Thread(s) per core: 2 Core(s) per socket: 15 Socket(s): 4 NUMA node(s): 4 Vendor ID: GenuineIntel CPU family: 6 Model: 62 Model name: Intel(R) Xeon(R) CPU E7-8895 v2 @ 2.80GHz Stepping: 7 CPU MHz: 3529.736 CPU max MHz: 3600.0000 CPU min MHz: 1200.0000 BogoMIPS: 5586.77 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 38400K NUMA node0 CPU(s): 0-14,60-74 NUMA node1 CPU(s): 15-29,75-89 NUMA node2 CPU(s): 30-44,90-104 NUMA node3 CPU(s): 45-59,105-119 # chcpu -d 15-119 CPU 15 disabled CPU 16 disabled CPU 17 disabled CPU 18 disabled ... CPU 119 disabled 3. Try to start the 'vnuma' guest domain again (see 1.) virsh # start --console vnuma error: Failed to start domain vnuma error: Unable to write to '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2d3\x2dvnuma.scope/vcpu0/cpuset.cpus': Permission denied 4. Try to reonline the physical CPUs. Try to restart libvirt ... find that your last resort is to restart the entire machine (BAD) :-( * Actual results: Broken ... not highly available. Machine needs a reboot to recover. * Expected results: Expected not to see this bug. Eventually expected to have seen a reasonable error messages. Also expected to see a possibility to recover without rebooting the machine. * Additional info: none.
This is a known serious design flaw in cgroups v1 CPU controller. It uses a single bitmap to record both the group's configured CPU mask, and the set of currently available CPUs. As a result, when you offline CPUs, it destroys the configured CPU mask, and does not repopulate it when you online the CPU again. You should be able to fix this by manually re-adding the online'd CPUs to the cgroup CPU mask, but you must do this recursively across the whole hierarchy, starting at the top level. Apparently this is fixed in cgroups v2, but the kernel devs won't backport it to cgroups v1 because it is a semantic change in behaviour, despite it being something that all users really want.
hit the issus with: RHEL-8.2.0-20191219.0-x86_64.qcow2 hardware information: Guest OS: RHEL-8.2.0-20191219.0-ppc64le.qcow2 Guest Kernel: 4.18.0-167.el8.x86_64 Host Kernel: 4.18.0-167.el8.x86_64 virsh -version: 5.10.0 qemu : qemu-kvm-4.2.0-5.module+el8.2.0+5389+367d9739) Host OS : RHEL-8.2.0-20191219.0-ppc64le.qcow2 Host Name : hp-dl388pg8-01.lab.eng.pek2.redhat.com 1.Administrate a KVM guest domain based on NUMA h/w with below pinning, (specific) numa topology added. This is a 2-node NUMA setup with 32 cpus. See the vcpupin detail below to understand its topology. We simply setup a mini NUMA architecture for 5 vcpus where 0 is set to hold affinity with NODE0 with NODE1 with NODE1. <vcpu placement='static' current='5'>8</vcpu> <cputune> <vcpupin vcpu='0' cpuset='0-2,20-22'/> <vcpupin vcpu='1' cpuset='3-5,23-25'/> <vcpupin vcpu='2' cpuset='6-8,26-28'/> <vcpupin vcpu='3' cpuset='9-11,29-31'/> <vcpupin vcpu='4' cpuset='0-2,20-22'/> </cputune> <os> <type arch='x86_64' machine='pc-q35-rhel8.1.0'>hvm</type> <boot dev='hd'/> </os> <cpu mode='host-model' check='none'> <model fallback='allow'/> <topology sockets='4' cores='1' threads='2'/> <numa> <cell id='0' cpus='0,4' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='10'/> <sibling id='1' value='21'/> </distances> </cell> <cell id='1' cpus='1' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='21'/> <sibling id='1' value='10'/> </distances> </cell> <cell id='2' cpus='2' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='31'/> <sibling id='1' value='21'/> </distances> </cell> <cell id='3' cpus='3' memory='2097152' unit='KiB'> <distances> <sibling id='0' value='21'/> <sibling id='1' value='31'/> </distances> </cell> </numa> </cpu> sudo virsh start avocado-vt-vm1 sudo virsh shutdown avocado-vt-vm1 2.Have the 'vnuma' guest domain shutdown. Take all physical node NUMA cpu's offline apart from those running under NODE0. # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 62 Model name: Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz Stepping: 4 CPU MHz: 1197.685 CPU max MHz: 2500.0000 CPU min MHz: 1200.0000 BogoMIPS: 3990.76 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 20480K NUMA node0 CPU(s): 0-7,16-23 NUMA node1 CPU(s): 8-15,24-31 # chcpu -d 8-31 CPU 8 disabled CPU 9 disabled CPU 10 disabled CPU 11 disabled CPU 12 disabled CPU 13 disabled 3. Try to start the 'vnuma' guest domain again (see 1.) # virsh start --console avocado-vt-vm1 error: Failed to start domain avocado-vt-vm1 error: Invalid value '0-2,20-22' for 'cpuset.cpus': Invalid argument 4. Try to reonline the physical CPUs. # chcpu -e 8-31 # virsh start --console avocado-vt-vm1 error: Failed to start domain avocado-vt-vm1 error: Unable to write to '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2d1\x2davocado\x2dvt\x2dvm1.scope/vcpu0/cpuset.cpus': Permission denied
Is it possible to document this behaviour, as explained in comment 1, in the knowledge base?
Closing because there's nothing practical libvirt can do to fix this in cgroups v1, but cgroups v2 should work correctly already. Since cgroups v2 is the recommended platform now this should increasingly become a non-issue for users.
Clearing old needinfo request. Please restore if you would like the question in comment 3 to be revisited.