Bug 1813395

Summary: Unable to start guest with a reduced number of active CPUs and multiple dies
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: Daniel Berrangé <berrange>
Component: libvirtAssignee: Daniel Berrangé <berrange>
Status: CLOSED ERRATA QA Contact: jiyan <jiyan>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.2CC: dyuan, jdenemar, jiyan, jsuchane, lhuang, lmen, mtessun, pkrempa, toneata, virt-maint, xuzhang
Target Milestone: rcKeywords: Triaged, ZStream
Target Release: 8.3   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-6.0.0-19.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1821592 (view as bug list) Environment:
Last Closed: 2020-07-28 07:12:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version: commit in BZ
Embargoed:
Bug Depends On:    
Bug Blocks: 1785207, 1819060, 1821592    

Description Daniel Berrangé 2020-03-13 16:52:14 UTC
Description of problem:
When configuring a guest with multiple 'dies' in the CPU topology

   <cpu...>
    <topology sockets='8' dies='2' cores='4' threads='2'/>
   </cpu>

If attempting to start the guest with a reduced number of active CPUs, to allow for later hotplug, the guest will fail to start sometimes.

eg this will work:

  <vcpu placement='static' current='48'>128</vcpu>

but this will fail:

  <vcpu placement='static' current='52'>128</vcpu>

  # virsh start test82 
  error: Failed to start domain test82
  error: internal error: qemu didn't report thread id for vcpu '48'

likewise

  <vcpu placement='static' current='49'>128</vcpu>

and

  <vcpu placement='static' current='47'>128</vcpu>
 
and certain other current values.

The root cause is that libvirt fails to take into account die_id when matching up CPUs reported by QEMU.


Version-Release number of selected component (if applicable):
qemu-kvm-4.2.0-13.module+el8.2.0+5898+fb4bceae.x86_64
kernel-4.18.0-187.el8.x86_64
libvirt-6.0.0-10.module+el8.2.0+5984+dce93708.x86_64

How reproducible:
ALways

Steps to Reproduce:
1. Configure a guest with

  <cpu...>
    <topology sockets='8' dies='2' cores='4' threads='2'/>
   </cpu>
  <vcpu placement='static' current='47'>128</vcpu>

2. Start the guest

Actual results:
It fails to start

Expected results:
It starts successfully

Additional info:

Comment 1 Daniel Berrangé 2020-03-13 16:53:14 UTC
Fix posted at

https://www.redhat.com/archives/libvir-list/2020-March/msg00509.html

Comment 6 Eduardo Habkost 2020-04-01 22:11:40 UTC
Upstream commit:
commit 8b789c657445 ("qemu: fix detection of vCPU pids when multiple dies are present")

Comment 12 jiyan 2020-05-15 08:39:38 UTC
Reproduced this bug with libvirt-6.0.0-18.module+el8.2.1+6456+a6d62e4e.x86_64, and verified this bug with libvirt-6.0.0-19.module+el8.2.1+6538+c148631f.x86_64.

Version:
libvirt-6.0.0-18.module+el8.2.1+6456+a6d62e4e.x86_64
qemu-kvm-4.2.0-21.module+el8.2.1+6586+8b7713b9.x86_64
kernel-4.18.0-193.2.1.el8_2.x86_64

Steps:
1. Prepare a shutdown VM with the following conf
# virsh domstate test82
shut off

# virsh dumpxml test82 --inactive
...
  <vcpu placement='static' current='96'>144</vcpu>
  <iothreads>2</iothreads>
  <iothreadids>
    <iothread id='2'/>
    <iothread id='1'/>
  </iothreadids>
  <cputune>
    <shares>2048</shares>
    <vcpupin vcpu='2' cpuset='0-7'/>
    <vcpupin vcpu='19' cpuset='7,170,191'/>
    <emulatorpin cpuset='1-3'/>
    <iothreadpin iothread='2' cpuset='7-8'/>
    <iothreadpin iothread='1' cpuset='5-6'/>
  </cputune>
  <numatune>
    <memory mode='strict' nodeset='0-2'/>
    <memnode cellid='0' mode='strict' nodeset='1'/>
    <memnode cellid='2' mode='preferred' nodeset='2'/>
  </numatune>
...
  <cpu mode='host-model' check='partial'>
    <topology sockets='4' dies='3' cores='4' threads='3'/>
    <numa>
      <cell id='0' cpus='0,4-7' memory='512000' unit='KiB'/>
      <cell id='1' cpus='1,8-10,12-15' memory='512000' unit='KiB' memAccess='shared'>
        <distances>
          <sibling id='1' value='10'/>
        </distances>
      </cell>
      <cell id='2' cpus='2,11' memory='512000' unit='KiB' memAccess='shared'>
        <distances>
          <sibling id='2' value='10'/>
        </distances>
      </cell>
      <cell id='3' cpus='3' memory='512000' unit='KiB'/>
    </numa>
  </cpu>

2. Start the VM
# virsh start test82
error: Failed to start domain test82
error: internal error: qemu didn't report thread id for vcpu '72'

3. Upgrade libvirt and restart libvirtd
# yum upgrade libvirt* -y

# systemctl restart libvirtd

# rpm -qa libvirt
libvirt-6.0.0-19.module+el8.2.1+6538+c148631f.x86_64

4. Repear step-2
# virsh start test82
Domain test82 started

# virsh vcpucount test82 
maximum      config       144
maximum      live         144
current      config        96
current      live          96

# ps -ef | grep test82
-smp 96,maxcpus=144,sockets=4,dies=3,cores=4,threads=3

 virsh console test82
Connected to domain test82
Escape character is ^]

Red Hat Enterprise Linux 8.2 (Ootpa)
Kernel 4.18.0-193.el8.x86_64 on an x86_64

localhost login: root
Password: 
[root@localhost ~]# lscpu 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  12
Socket(s):           3
NUMA node(s):        4
Vendor ID:           GenuineIntel

5. Test hot-plugging/unplugging vpus
# virsh setvcpus test82 137

# virsh setvcpu test82 109 --disable

# virsh setvcpu test82 123 --disable

# virsh setvcpu test82 143 --enable

# virsh vcpucount test82 
maximum      config       144
maximum      live         144
current      config        96
current      live         136

# virsh console test82
Connected to domain test82
Escape character is ^]

[root@localhost ~]# lscpu 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              136
On-line CPU(s) list: 0-108,110-122,124-137
Thread(s) per core:  2
Core(s) per socket:  13
Socket(s):           4
NUMA node(s):        4
Vendor ID:           GenuineIntel

All the test are as expected, move this bug to be verified.

Comment 14 errata-xmlrpc 2020-07-28 07:12:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3172