Description of problem: CPU Hot unplug throws "error: operation failed: vcpu unplug request timed out" during the first iteration of hotplug/unplug operation and leaves cpu hotplug/unplug not functional. Version-Release number of selected component (if applicable): Upstream libvirt compiled at Repo: https://github.com/libvirt/libvirt.git Branch: master Commit: ca1f38545750d597c75c9773723c716483b03e5c #libvirtd --version libvirtd (libvirt) 3.1.0 Upstream qemu compiled at Repo: git://git.qemu.org/qemu.git Branch: master Commit: e7c83a885f865128ae3cf1946f8cb538b63cbfba # qemu-kvm -version QEMU emulator version 2.8.50 (v2.8.0-1684-ge7c83a8-dirty) Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers How reproducible: Always Steps to Reproduce: 1. Define a guest(Fedora25) with current 8 vcpus and 240 max vcpus with following topology ... <vcpu placement='static' current='8'>240</vcpu> ... <cpu> <topology sockets='2' cores='15' threads='8'/> </cpu> ... 2. virsh setvcpus virt-tests-vm1 16 --live 3. wait some time then, virsh vcpucount virt-tests-vm1 --guest 16 -----OK 4. virsh setvcpus virt-tests-vm1 8 --live error: operation failed: vcpu unplug request timed out --------NOK 5. wait some time then, virsh vcpucount virt-tests-vm1 --guest 8 6. virsh setvcpus virt-tests-vm1 16 --live 7. wait some time then, virsh vcpucount virt-tests-vm1 --guest 16 -----NOK 8. virsh setvcpus virt-tests-vm1 8 --live error: internal error: unable to execute QEMU command 'device_del': Device 'vcpu8' not found--------NOK 9. wait some time then, virsh vcpucount virt-tests-vm1 --guest 8 Actual results: Cpu unplug errors out leaving hotplug/unplug operation not functional. Expected results: Cpu hotplug/unplug should work fine Additional info: Same behavior is observed while using setvcpu API also!!. Guest XML: <domain type='kvm' id='5'> <name>virt-tests-vm1</name> <uuid>496d2299-4a36-4403-a583-6e07bec87070</uuid> <memory unit='KiB'>33554432</memory> <currentMemory unit='KiB'>33554432</currentMemory> <vcpu placement='static' current='8'>240</vcpu> <resource> <partition>/machine</partition> </resource> <os> <type arch='ppc64le' machine='pseries-2.9'>hvm</type> <boot dev='hd'/> </os> <cpu> <topology sockets='2' cores='15' threads='8'/> </cpu> <clock offset='utc'/> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>restart</on_crash> <devices> <emulator>/usr/bin/qemu-kvm</emulator> <disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source file='/home/sath/runAvocadoFVTTest/avocado-fvt-wrapper/data/avocado-vt/images/f25-ppc64le.qcow2'/> <backingStore/> <target dev='sda' bus='scsi'/> <alias name='scsi0-0-0-0'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk> <controller type='scsi' index='0' model='virtio-scsi'> <alias name='scsi0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> </controller> <controller type='usb' index='0' model='pci-ohci'> <alias name='usb'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/> </controller> <controller type='pci' index='0' model='pci-root'> <alias name='pci.0'/> </controller> <controller type='virtio-serial' index='0'> <alias name='virtio-serial0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x05' function='0x0'/> </controller> <interface type='bridge'> <mac address='52:54:00:61:62:63'/> <source bridge='virbr0'/> <target dev='vnet0'/> <model type='virtio'/> <alias name='net0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x0'/> </interface> <serial type='pty'> <source path='/dev/pts/0'/> <target port='0'/> <alias name='serial0'/> <address type='spapr-vio' reg='0x30000000'/> </serial> <console type='pty' tty='/dev/pts/0'> <source path='/dev/pts/0'/> <target type='serial' port='0'/> <alias name='serial0'/> <address type='spapr-vio' reg='0x30000000'/> </console> <channel type='unix'> <source mode='bind' path='/var/lib/libvirt/qemu/f25ppc64le.agent'/> <target type='virtio' name='org.qemu.guest_agent.0' state='connected'/> <alias name='channel0'/> <address type='virtio-serial' controller='0' bus='0' port='1'/> </channel> <memballoon model='virtio'> <alias name='balloon0'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/> </memballoon> <panic model='pseries'/> </devices> <seclabel type='dynamic' model='selinux' relabel='yes'> <label>system_u:system_r:svirt_t:s0:c62,c845</label> <imagelabel>system_u:object_r:svirt_image_t:s0:c62,c845</imagelabel> </seclabel> <seclabel type='dynamic' model='dac' relabel='yes'> <label>+107:+107</label> <imagelabel>+107:+107</imagelabel> </seclabel> </domain>
It is reproducible always #for i in {1..10};do virsh destroy virt-tests-vm1;virsh start virt-tests-vm1;sleep 120;virsh setvcpus virt-tests-vm1 16 --live;sleep 120;virsh vcpucount virt-tests-vm1 --guest;virsh setvcpus virt-tests-vm1 8 --live;sleep 120;virsh vcpucount virt-tests-vm1 --guest;done Domain virt-tests-vm1 destroyed Domain virt-tests-vm1 started 16 error: operation failed: vcpu unplug request timed out 8 Domain virt-tests-vm1 destroyed Domain virt-tests-vm1 started 16 error: operation failed: vcpu unplug request timed out 8 Domain virt-tests-vm1 destroyed Domain virt-tests-vm1 started 16 error: operation failed: vcpu unplug request timed out 8 Domain virt-tests-vm1 destroyed Domain virt-tests-vm1 started 16 error: operation failed: vcpu unplug request timed out 8 Domain virt-tests-vm1 destroyed Domain virt-tests-vm1 started 16 error: operation failed: vcpu unplug request timed out 8 Domain virt-tests-vm1 destroyed Domain virt-tests-vm1 started 16 error: operation failed: vcpu unplug request timed out 8 Domain virt-tests-vm1 destroyed Domain virt-tests-vm1 started 16 error: operation failed: vcpu unplug request timed out 8 Domain virt-tests-vm1 destroyed Domain virt-tests-vm1 started 16 error: operation failed: vcpu unplug request timed out 8 Domain virt-tests-vm1 destroyed Domain virt-tests-vm1 started 16 error: operation failed: vcpu unplug request timed out 8 Domain virt-tests-vm1 destroyed Domain virt-tests-vm1 started 16 error: operation failed: vcpu unplug request timed out 8
Got it pasted twice, ignore one of the previous comment
Please attach the debug log of libvirtd while this bug reproduces.
Created attachment 1259113 [details] libvirtdlog Attached libvirtd debug log for the below commands # virsh setvcpus virt-tests-vm1 16 --live # virsh vcpucount virt-tests-vm1 --guest # virsh setvcpus virt-tests-vm1 8 --live
The logs you've provided are from the libvirt client library, not libvirtd - the daemon. I specifically asked for the daemon logs since the client logs are pretty useless. Please attach the proper log from the daemon.
Created attachment 1259169 [details] libvirt daemon log logs were during the below command run #virsh setvcpus virt-tests-vm1 8 --live error: operation failed: vcpu unplug request timed out
That's unfortunately not enough. That code behaves exactly as it should during the first failure. The guest OS did not allow to unplug the vcpus and thus libvirt reported "vcpu unplug request timed out". This is expected. The issue with hotplug fully breaking after that is not captured in this log. Please fully reproduce it and post logs with everything.
Created attachment 1259407 [details] libvirt daemon log Attached libvirtd daemon log for the following series of events # virsh vcpucount virt-tests-vm1 --guest 8 # virsh setvcpus virt-tests-vm1 16 --live # virsh vcpucount virt-tests-vm1 --guest 16 # virsh setvcpus virt-tests-vm1 8 --live error: operation failed: vcpu unplug request timed out # virsh vcpucount virt-tests-vm1 --guest 8 # virsh setvcpus virt-tests-vm1 16 --live # virsh vcpucount virt-tests-vm1 --guest 8 # virsh vcpucount virt-tests-vm1 --guest 8 # virsh setvcpus virt-tests-vm1 8 --live error: internal error: unable to execute QEMU command 'device_del': Device 'vcpu8' not found # virsh vcpucount virt-tests-vm1 --guest 8 # virsh setvcpus virt-tests-vm1 16 --live # echo $? 0 # virsh vcpucount virt-tests-vm1 --guest 8 # virsh setvcpus virt-tests-vm1 8 --live error: internal error: unable to execute QEMU command 'device_del': Device 'vcpu8' not found
Fixed upstream: commit 8af68ea47830b8d32907dc50c6ca4869d14bb862 Author: Peter Krempa <pkrempa> Date: Fri Mar 3 16:04:57 2017 +0100 qemu: hotplug: Reset device removal waiting code after vCPU unplug If the delivery of the DEVICE_DELETED event for the vCPU being deleted would time out, the code would not call 'qemuDomainResetDeviceRemoval'. Since the waiting thread did not unregister itself prior to stopping the waiting the monitor code would try to wake it up instead of dispatching it to the event worker. As a result the unplug process would not be completed and the definition would not be updated.
Am able to reproduce the issue with this patched libvirt aswell, am I missing something else? libvirt compiled at below commit commit a6d681485ff85e27859583a5c20e1630c5cf8352 Author: John Ferlan <jferlan> Date: Tue Mar 7 16:10:38 2017 -0500 Using F25 guest with upstream kernel. # uname -a Linux atest-guest 4.11.0-rc3 #1 SMP Mon Mar 20 08:59:24 EDT 2017 ppc64le ppc64le ppc64le GNU/Linux #virsh setvcpus vm1 255 --live;echo $?;virsh vcpucount vm1 --guest 0 154 # virsh vcpucount vm1 --guest 255 #virsh setvcpus vm1 1 --live;echo $?;virsh vcpucount vm1 --guest error: operation failed: vcpu unplug request timed out 1 error: Guest agent is not responding: Guest agent not available for now # virsh vcpucount vm1 --guest 254 error log from libvirtd log 2017-03-21 06:06:44.451+0000: 163682: error : qemuDomainHotplugDelVcpu:5383 : operation failed: vcpu unplug request timed out 2017-03-21 06:06:49.481+0000: 163684: error : qemuAgentSend:915 : Guest agent is not responding: Guest agent not available for now 2017-03-21 06:08:14.691+0000: 163686: error : qemuDomainHotplugDelVcpu:5383 : operation failed: vcpu unplug request timed out Let me know incase you need a log with debug enabled?
The operation may still time out. My patches fixed the problem that once the operation actually finishes in the guest (after the timeout was reported) and the hypervisor is able to detach the vCPUS the libvirt data structures are updated properly which they did not previously and thus any further cpu-unplug attempt failed. Since the guest may stall for an undeterministic amount of time until it actually unplugs the vcpus the timeout code still needs to stay in place.