Bug 1427801 - CPU Hotunplug throws "error: operation failed: vcpu unplug request timed out" and leaves hotplug operation non functional
Summary: CPU Hotunplug throws "error: operation failed: vcpu unplug request timed out"...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: Virtualization Tools
Classification: Community
Component: libvirt
Version: unspecified
Hardware: ppc64le
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Peter Krempa
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1428893
TreeView+ depends on / blocked
 
Reported: 2017-03-01 09:22 UTC by Satheesh Rajendran
Modified: 2017-06-21 16:46 UTC (History)
5 users (show)

Fixed In Version:
Clone Of:
: 1428893 (view as bug list)
Environment:
Last Closed: 2017-03-10 08:13:21 UTC
Embargoed:


Attachments (Terms of Use)
libvirtdlog (315.57 KB, text/plain)
2017-03-02 12:05 UTC, Satheesh Rajendran
no flags Details
libvirt daemon log (764.77 KB, text/plain)
2017-03-02 13:41 UTC, Satheesh Rajendran
no flags Details
libvirt daemon log (593.11 KB, application/x-gzip)
2017-03-03 04:06 UTC, Satheesh Rajendran
no flags Details

Description Satheesh Rajendran 2017-03-01 09:22:35 UTC
Description of problem:
CPU Hot unplug throws "error: operation failed: vcpu unplug request timed out" during the first iteration of hotplug/unplug operation and leaves cpu hotplug/unplug not functional.

Version-Release number of selected component (if applicable):
Upstream libvirt compiled at
Repo: https://github.com/libvirt/libvirt.git 
Branch: master 
Commit: ca1f38545750d597c75c9773723c716483b03e5c
#libvirtd --version
libvirtd (libvirt) 3.1.0


Upstream qemu compiled at
Repo:  git://git.qemu.org/qemu.git 
Branch: master 
Commit: e7c83a885f865128ae3cf1946f8cb538b63cbfba
# qemu-kvm -version
QEMU emulator version 2.8.50 (v2.8.0-1684-ge7c83a8-dirty)
Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers


How reproducible:
Always

Steps to Reproduce:
1. Define a guest(Fedora25) with current 8 vcpus and 240 max vcpus with following topology
  ...
  <vcpu placement='static' current='8'>240</vcpu>
  ...
  <cpu>
    <topology sockets='2' cores='15' threads='8'/>
  </cpu>
  ...
2. virsh setvcpus virt-tests-vm1 16 --live

3. wait some time then, virsh vcpucount virt-tests-vm1 --guest
16 -----OK

4. virsh setvcpus virt-tests-vm1 8 --live
error: operation failed: vcpu unplug request timed out --------NOK

5. wait some time then, virsh vcpucount virt-tests-vm1 --guest

8

6. virsh setvcpus virt-tests-vm1 16 --live 

7. wait some time then, virsh vcpucount virt-tests-vm1 --guest
16 -----NOK

8. virsh setvcpus virt-tests-vm1 8 --live
error: internal error: unable to execute QEMU command 'device_del': Device 'vcpu8' not found--------NOK

9. wait some time then, virsh vcpucount virt-tests-vm1 --guest

8

Actual results:
Cpu unplug  errors out leaving hotplug/unplug operation not functional.


Expected results:
Cpu hotplug/unplug should work fine

Additional info:
Same behavior is observed while using setvcpu API also!!.

Guest XML:
<domain type='kvm' id='5'>
  <name>virt-tests-vm1</name>
  <uuid>496d2299-4a36-4403-a583-6e07bec87070</uuid>
  <memory unit='KiB'>33554432</memory>
  <currentMemory unit='KiB'>33554432</currentMemory>
  <vcpu placement='static' current='8'>240</vcpu>
  <resource>
    <partition>/machine</partition>
  </resource>
  <os>
    <type arch='ppc64le' machine='pseries-2.9'>hvm</type>
    <boot dev='hd'/>
  </os>
  <cpu>
    <topology sockets='2' cores='15' threads='8'/>
  </cpu>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>restart</on_crash>
  <devices>
    <emulator>/usr/bin/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/home/sath/runAvocadoFVTTest/avocado-fvt-wrapper/data/avocado-vt/images/f25-ppc64le.qcow2'/>
      <backingStore/>
      <target dev='sda' bus='scsi'/>
      <alias name='scsi0-0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>
    <controller type='scsi' index='0' model='virtio-scsi'>
      <alias name='scsi0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/>
    </controller>
    <controller type='usb' index='0' model='pci-ohci'>
      <alias name='usb'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </controller>
    <controller type='pci' index='0' model='pci-root'>
      <alias name='pci.0'/>
    </controller>
    <controller type='virtio-serial' index='0'>
      <alias name='virtio-serial0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/>
    </controller>
    <interface type='bridge'>
      <mac address='52:54:00:61:62:63'/>
      <source bridge='virbr0'/>
      <target dev='vnet0'/>
      <model type='virtio'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/>
    </interface>
    <serial type='pty'>
      <source path='/dev/pts/0'/>
      <target port='0'/>
      <alias name='serial0'/>
      <address type='spapr-vio' reg='0x30000000'/>
    </serial>
    <console type='pty' tty='/dev/pts/0'>
      <source path='/dev/pts/0'/>
      <target type='serial' port='0'/>
      <alias name='serial0'/>
      <address type='spapr-vio' reg='0x30000000'/>
    </console>
    <channel type='unix'>
      <source mode='bind' path='/var/lib/libvirt/qemu/f25ppc64le.agent'/>
      <target type='virtio' name='org.qemu.guest_agent.0' state='connected'/>
      <alias name='channel0'/>
      <address type='virtio-serial' controller='0' bus='0' port='1'/>
    </channel>
    <memballoon model='virtio'>
      <alias name='balloon0'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </memballoon>
    <panic model='pseries'/>
  </devices>
  <seclabel type='dynamic' model='selinux' relabel='yes'>
    <label>system_u:system_r:svirt_t:s0:c62,c845</label>
    <imagelabel>system_u:object_r:svirt_image_t:s0:c62,c845</imagelabel>
  </seclabel>
  <seclabel type='dynamic' model='dac' relabel='yes'>
    <label>+107:+107</label>
    <imagelabel>+107:+107</imagelabel>
  </seclabel>
</domain>

Comment 1 Satheesh Rajendran 2017-03-02 03:16:06 UTC
It is reproducible always

#for i in {1..10};do virsh destroy virt-tests-vm1;virsh start virt-tests-vm1;sleep 120;virsh setvcpus virt-tests-vm1 16 --live;sleep 120;virsh vcpucount virt-tests-vm1 --guest;virsh setvcpus virt-tests-vm1 8 --live;sleep 120;virsh vcpucount virt-tests-vm1 --guest;done
Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Comment 2 Satheesh Rajendran 2017-03-02 03:16:51 UTC
It is reproducible always

#for i in {1..10};do virsh destroy virt-tests-vm1;virsh start virt-tests-vm1;sleep 120;virsh setvcpus virt-tests-vm1 16 --live;sleep 120;virsh vcpucount virt-tests-vm1 --guest;virsh setvcpus virt-tests-vm1 8 --live;sleep 120;virsh vcpucount virt-tests-vm1 --guest;done
Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Domain virt-tests-vm1 destroyed

Domain virt-tests-vm1 started


16

error: operation failed: vcpu unplug request timed out

8

Comment 3 Satheesh Rajendran 2017-03-02 03:18:14 UTC
Got it pasted twice, ignore one of the previous comment

Comment 4 Peter Krempa 2017-03-02 07:57:09 UTC
Please attach the debug log of libvirtd while this bug reproduces.

Comment 5 Satheesh Rajendran 2017-03-02 12:05:49 UTC
Created attachment 1259113 [details]
libvirtdlog

Attached libvirtd debug log for the below commands
# virsh setvcpus virt-tests-vm1 16 --live
# virsh vcpucount virt-tests-vm1 --guest
# virsh setvcpus virt-tests-vm1 8 --live

Comment 6 Peter Krempa 2017-03-02 12:14:55 UTC
The logs you've provided are from the libvirt client library, not libvirtd - the daemon.

I specifically asked for the daemon logs since the client logs are pretty useless. Please attach the proper log from the daemon.

Comment 7 Satheesh Rajendran 2017-03-02 13:41:09 UTC
Created attachment 1259169 [details]
libvirt daemon log

logs were during the below command run
#virsh setvcpus virt-tests-vm1 8 --live
error: operation failed: vcpu unplug request timed out

Comment 8 Peter Krempa 2017-03-02 14:45:58 UTC
That's unfortunately not enough. That code behaves exactly as it should during the first failure.

The guest OS did not allow to unplug the vcpus and thus libvirt reported "vcpu unplug request timed out". This is expected.

The issue with hotplug fully breaking after that is not captured in this log. Please fully reproduce it and post logs with everything.

Comment 9 Satheesh Rajendran 2017-03-03 04:06:27 UTC
Created attachment 1259407 [details]
libvirt daemon log

Attached libvirtd daemon log for the following series of events
# virsh vcpucount virt-tests-vm1 --guest
8
# virsh setvcpus virt-tests-vm1 16 --live

# virsh vcpucount virt-tests-vm1 --guest
16

# virsh setvcpus virt-tests-vm1 8 --live
error: operation failed: vcpu unplug request timed out

# virsh vcpucount virt-tests-vm1 --guest
8

# virsh setvcpus virt-tests-vm1 16 --live

# virsh vcpucount virt-tests-vm1 --guest
8

# virsh vcpucount virt-tests-vm1 --guest
8

# virsh setvcpus virt-tests-vm1 8 --live
error: internal error: unable to execute QEMU command 'device_del': Device 'vcpu8' not found

# virsh vcpucount virt-tests-vm1 --guest
8

# virsh setvcpus virt-tests-vm1 16 --live

# echo $?
0
# virsh vcpucount virt-tests-vm1 --guest
8

# virsh setvcpus virt-tests-vm1 8 --live
error: internal error: unable to execute QEMU command 'device_del': Device 'vcpu8' not found

Comment 10 Peter Krempa 2017-03-10 08:13:21 UTC
Fixed upstream:

commit 8af68ea47830b8d32907dc50c6ca4869d14bb862
Author: Peter Krempa <pkrempa>
Date:   Fri Mar 3 16:04:57 2017 +0100

    qemu: hotplug: Reset device removal waiting code after vCPU unplug
    
    If the delivery of the DEVICE_DELETED event for the vCPU being deleted
    would time out, the code would not call 'qemuDomainResetDeviceRemoval'.
    
    Since the waiting thread did not unregister itself prior to stopping the
    waiting the monitor code would try to wake it up instead of dispatching
    it to the event worker. As a result the unplug process would not be
    completed and the definition would not be updated.

Comment 11 Satheesh Rajendran 2017-03-21 06:14:54 UTC
Am able to reproduce the issue with this patched libvirt aswell, am I missing something else?

libvirt compiled at below commit
commit a6d681485ff85e27859583a5c20e1630c5cf8352
Author: John Ferlan <jferlan>
Date:   Tue Mar 7 16:10:38 2017 -0500



Using F25 guest with upstream kernel.
# uname -a
Linux atest-guest 4.11.0-rc3 #1 SMP Mon Mar 20 08:59:24 EDT 2017 ppc64le ppc64le ppc64le GNU/Linux



#virsh setvcpus vm1 255 --live;echo $?;virsh vcpucount vm1 --guest

0
154

# virsh vcpucount vm1 --guest
255


#virsh setvcpus vm1 1 --live;echo $?;virsh vcpucount vm1 --guest
error: operation failed: vcpu unplug request timed out

1
error: Guest agent is not responding: Guest agent not available for now

#  virsh vcpucount vm1 --guest
254


error log from libvirtd log
2017-03-21 06:06:44.451+0000: 163682: error : qemuDomainHotplugDelVcpu:5383 : operation failed: vcpu unplug request timed out
2017-03-21 06:06:49.481+0000: 163684: error : qemuAgentSend:915 : Guest agent is not responding: Guest agent not available for now
2017-03-21 06:08:14.691+0000: 163686: error : qemuDomainHotplugDelVcpu:5383 : operation failed: vcpu unplug request timed out

Let me know incase you need a log with debug enabled?

Comment 12 Peter Krempa 2017-03-21 07:28:22 UTC
The operation may still time out. My patches fixed the problem that once the operation actually finishes in the guest (after the timeout was reported) and the hypervisor is able to detach the vCPUS the libvirt data structures are updated properly which they did not previously and thus any further cpu-unplug attempt failed.

Since the guest may stall for an undeterministic amount of time until it actually unplugs the vcpus the timeout code still needs to stay in place.


Note You need to log in before you can comment on or make changes to this bug.