Bug 1609785

Summary:	error: Unable to write to '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2d3\x2dvnuma.scope/vcpu0/cpuset.cpus': Permission denied
Product:	[Community] Virtualization Tools	Reporter:	Wim ten Have <whaveten>
Component:	libvirt	Assignee:	Libvirt Maintainers <libvirt-maint>
Status:	CLOSED CANTFIX	QA Contact:
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	unspecified	CC:	berrange, jaak, jen, libvirt-maint, muriloo, pmarciniak, wim.ten.have, yuhuang, zhenyzha
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-11-03 12:55:29 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Wim ten Have 2018-07-30 12:40:49 UTC

* Description of problem:
  Starting <cpu ... <numa> guest domains on systems with unavailable cpu resources causes libvirt to get into an unworkable state reporting

  error: Unable to write to '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2d3\x2dvnuma.scope/vcpu0/cpuset.cpus': Permission denied

  It seems from not possible regain functionality by adding the missing cpu resources and restart libvirt.

* Version-Release number of selected component (if applicable):
  Latest Fedora (from F26, F27, F28, ... CentOS, RHT?)

* How reproducible:
  100%

* Steps to Reproduce:
  1. Administrate a KVM guest domain based on NUMA h/w with below pinning, (specific) numa topology added. This is a 4-node NUMA setup with 120 cpus.  See the vcpupin detail below to understand its topology.  We simply setup a mini NUMA architecture for 5 vcpus where 0,4 are set to hold affinity with NODE0, 1 with NODE1, 2 with NODE2 and 3 with NODE3.

     <domain type='kvm'>
       <name>vnuma</name>
         ...
       <vcpu placement='static' current='5'>8</vcpu>
       <cputune>
         <vcpupin vcpu='0' cpuset='0-14,60-74'/>
         <vcpupin vcpu='1' cpuset='15-29,75-89'/>
         <vcpupin vcpu='2' cpuset='30-44,90-104'/>
         <vcpupin vcpu='3' cpuset='45-59,105-119'/>
         <vcpupin vcpu='4' cpuset='0-14,60-74'/>
       </cputune>
         ...
      <cpu mode='host-model' check='none'>
        <model fallback='allow'/>
        <topology sockets='4' cores='1' threads='2'/>
        <numa>
          <cell id='0' cpus='0,4' memory='2097152' unit='KiB'>
            <distances>
              <sibling id='0' value='10'/>
              <sibling id='1' value='21'/>
              <sibling id='2' value='31'/>
              <sibling id='3' value='21'/>
            </distances>
          </cell>
          <cell id='1' cpus='1' memory='2097152' unit='KiB'>
            <distances>
              <sibling id='0' value='21'/>
              <sibling id='1' value='10'/>
              <sibling id='2' value='21'/>
              <sibling id='3' value='31'/>
            </distances>
          </cell>
          <cell id='2' cpus='2' memory='2097152' unit='KiB'>
          <distances>
            <sibling id='0' value='31'/>
            <sibling id='1' value='21'/>
            <sibling id='2' value='10'/>
            <sibling id='3' value='21'/>
          </distances>
        </cell>
        <cell id='3' cpus='3' memory='2097152' unit='KiB'>
          <distances>
            <sibling id='0' value='21'/>
            <sibling id='1' value='31'/>
            <sibling id='2' value='21'/>
            <sibling id='3' value='10'/>
          </distances>
        </cell>
      </numa>
    </cpu>
      ..
  </domain>

  Before going to step 2. test if the above works.  It should.  You should be able to create a guest domain no based on 4 NUMA administrated cells.  So, simply said (below should work prior continuing);
  sudo virsh start vnuma
  sudo virsh shutdown vnuma

  2. Have the 'vnuma' guest domain shutdown.
     Take all physical node NUMA cpu's offline apart from those running under NODE0.

     # lscpu 
     Architecture:          x86_64
     CPU op-mode(s):        32-bit, 64-bit
     Byte Order:            Little Endian
     CPU(s):                120
     On-line CPU(s) list:   0-119
     Thread(s) per core:    2
     Core(s) per socket:    15
     Socket(s):             4
     NUMA node(s):          4
     Vendor ID:             GenuineIntel
     CPU family:            6
     Model:                 62
     Model name:            Intel(R) Xeon(R) CPU E7-8895 v2 @ 2.80GHz
     Stepping:              7
     CPU MHz:               3529.736
     CPU max MHz:           3600.0000
     CPU min MHz:           1200.0000
     BogoMIPS:              5586.77
     Virtualization:        VT-x
     L1d cache:             32K
     L1i cache:             32K
     L2 cache:              256K
     L3 cache:              38400K
     NUMA node0 CPU(s):     0-14,60-74
     NUMA node1 CPU(s):     15-29,75-89
     NUMA node2 CPU(s):     30-44,90-104
     NUMA node3 CPU(s):     45-59,105-119

     # chcpu -d 15-119
     CPU 15 disabled
     CPU 16 disabled
     CPU 17 disabled
     CPU 18 disabled
       ...
     CPU 119 disabled

  
  3. Try to start the 'vnuma' guest domain again (see 1.)
     virsh # start --console vnuma 
     error: Failed to start domain vnuma
     error: Unable to write to '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2d3\x2dvnuma.scope/vcpu0/cpuset.cpus': Permission denied

  4. Try to reonline the physical CPUs.  Try to restart libvirt ... find that your last resort is to restart the entire machine (BAD) :-(


* Actual results:
  Broken ... not highly available.  Machine needs a reboot to recover.   

* Expected results:
  Expected not to see this bug.  Eventually expected to have seen a reasonable error messages.  Also expected to see a possibility to recover without rebooting the machine.


* Additional info:
  none.

Comment 1 Daniel Berrangé 2018-07-30 12:50:52 UTC

This is a known serious design flaw in cgroups v1  CPU controller. It uses a single bitmap to record both the group's configured CPU mask, and the set of currently available CPUs. As a result, when you offline CPUs, it destroys the configured CPU mask, and does not repopulate it when you online the CPU again.

You should be able to fix this by manually re-adding the online'd CPUs to the cgroup CPU mask, but you must do this recursively across the whole hierarchy, starting at the top level. 

Apparently this is fixed in cgroups v2, but the kernel devs won't backport it to cgroups v1 because it is a semantic change in behaviour, despite it being something that all users really want.

Comment 2 Zhenyu Zhang 2020-01-09 07:26:20 UTC

hit the issus with: RHEL-8.2.0-20191219.0-x86_64.qcow2

hardware information:
Guest OS: RHEL-8.2.0-20191219.0-ppc64le.qcow2

Guest Kernel: 4.18.0-167.el8.x86_64
Host Kernel:  4.18.0-167.el8.x86_64
virsh -version: 5.10.0
qemu :        qemu-kvm-4.2.0-5.module+el8.2.0+5389+367d9739)

Host OS :  RHEL-8.2.0-20191219.0-ppc64le.qcow2
Host Name :  hp-dl388pg8-01.lab.eng.pek2.redhat.com

1.Administrate a KVM guest domain based on NUMA h/w with below pinning, (specific) numa topology added. This is a 2-node NUMA setup with 32 cpus.  See the vcpupin detail below to understand its topology.  We simply setup a mini NUMA architecture for 5 vcpus where 0 is set to hold affinity with NODE0 with NODE1 with NODE1.
  <vcpu placement='static' current='5'>8</vcpu>
  <cputune>
    <vcpupin vcpu='0' cpuset='0-2,20-22'/>
    <vcpupin vcpu='1' cpuset='3-5,23-25'/>
    <vcpupin vcpu='2' cpuset='6-8,26-28'/>
    <vcpupin vcpu='3' cpuset='9-11,29-31'/>
    <vcpupin vcpu='4' cpuset='0-2,20-22'/>
  </cputune>
  <os>
    <type arch='x86_64' machine='pc-q35-rhel8.1.0'>hvm</type>
    <boot dev='hd'/>
  </os>
  <cpu mode='host-model' check='none'>
    <model fallback='allow'/>
    <topology sockets='4' cores='1' threads='2'/>
    <numa>
      <cell id='0' cpus='0,4' memory='2097152' unit='KiB'>
        <distances>
          <sibling id='0' value='10'/>
          <sibling id='1' value='21'/>
        </distances>
      </cell>
      <cell id='1' cpus='1' memory='2097152' unit='KiB'>
        <distances>
          <sibling id='0' value='21'/>
          <sibling id='1' value='10'/>
        </distances>
      </cell>
      <cell id='2' cpus='2' memory='2097152' unit='KiB'>
        <distances>
          <sibling id='0' value='31'/>
          <sibling id='1' value='21'/>
        </distances>
      </cell>
      <cell id='3' cpus='3' memory='2097152' unit='KiB'>
        <distances>
          <sibling id='0' value='21'/>
          <sibling id='1' value='31'/>
        </distances>
      </cell>
    </numa>
  </cpu>

  sudo virsh start avocado-vt-vm1
  sudo virsh shutdown avocado-vt-vm1

2.Have the 'vnuma' guest domain shutdown.
     Take all physical node NUMA cpu's offline apart from those running under NODE0.
# lscpu
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               62
Model name:          Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz
Stepping:            4
CPU MHz:             1197.685
CPU max MHz:         2500.0000
CPU min MHz:         1200.0000
BogoMIPS:            3990.76
Virtualization:      VT-x
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            20480K
NUMA node0 CPU(s):   0-7,16-23
NUMA node1 CPU(s):   8-15,24-31

# chcpu -d 8-31
CPU 8 disabled
CPU 9 disabled
CPU 10 disabled
CPU 11 disabled
CPU 12 disabled
CPU 13 disabled

3. Try to start the 'vnuma' guest domain again (see 1.)
# virsh start --console avocado-vt-vm1
error: Failed to start domain avocado-vt-vm1
error: Invalid value '0-2,20-22' for 'cpuset.cpus': Invalid argument

4. Try to reonline the physical CPUs. 
# chcpu -e 8-31
# virsh start --console avocado-vt-vm1
error: Failed to start domain avocado-vt-vm1
error: Unable to write to '/sys/fs/cgroup/cpuset/machine.slice/machine-qemu\x2d1\x2davocado\x2dvt\x2dvm1.scope/vcpu0/cpuset.cpus': Permission denied

Comment 3 Murilo Opsfelder Araújo 2020-01-09 13:56:16 UTC

Is it possible to document this behaviour, as explained in comment 1, in the knowledge base?

Comment 4 Daniel Berrangé 2020-11-03 12:55:29 UTC

Closing because there's nothing practical libvirt can do to fix this in cgroups v1, but  cgroups v2 should work correctly already. Since cgroups v2 is the recommended platform now this should increasingly become a non-issue for users.

Comment 5 Jeff Nelson 2021-06-03 23:41:46 UTC

Clearing old needinfo request. Please restore if you would like the question in comment 3 to be revisited.