Bug 1453167
Summary: | [PPC] [Hot unplug CPU] Failed to hot unplug after migration | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Israel Pinto <ipinto> | ||||||||||
Component: | qemu-kvm-rhev | Assignee: | Laurent Vivier <lvivier> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Xujun Ma <xuma> | ||||||||||
Severity: | high | Docs Contact: | |||||||||||
Priority: | unspecified | ||||||||||||
Version: | 7.3 | CC: | bugproxy, dgibson, hannsj_uhl, imammedo, ipinto, knoel, michen, mrezanin, pkrempa, qzhang, rbalakri, thuth, virt-maint, xuma | ||||||||||
Target Milestone: | rc | Keywords: | Patch | ||||||||||
Target Release: | 7.5 | ||||||||||||
Hardware: | ppc64le | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | qemu-kvm-rhev-2.10.0-1.el7 | Doc Type: | If docs needed, set a value | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2018-04-11 00:19:31 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | 1493086 | ||||||||||||
Bug Blocks: | 1444027, 1473046 | ||||||||||||
Attachments: |
|
Description
Israel Pinto
2017-05-22 09:54:44 UTC
Created attachment 1280979 [details]
vdsm_log
Created attachment 1280980 [details]
vm_qemu
Created attachment 1280982 [details]
libvirt
Created attachment 1280994 [details]
libvirt_2
This is an expected error message. If the guest OS does not unplug the vCPU in a grace period, the API call consider it as failed. The guest OS may unplug the VCPU later. Libvirt may add an event to notify the mgmt app that the VCPU unplug was completed at a later time. Other than that. The behavior is expected from libvirt's point of view. If the unplug does not succeed in a given time, we give up since there's no way to know whether the OS will ever unplug the VCPU. Reassigning to qemu, to investigate why that happens to qemu after migration deterministically. (In reply to Peter Krempa from comment #7) > Reassigning to qemu, to investigate why that happens to qemu after migration > deterministically. Could you provide the QEMU command line for the source and the destination QEMU? The commands used to plug and unplug the CPU? Thanks BTW, it really looks like: https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1677552 And there is a similarBZ opened on hotplugged memory migration: BZ 1448810 I've just merged a handful of fixes upstream for hotplug/migration interactions. There are still pieces missing, but what we have already should help some cases, and might be enough to deal with the case here. The patches haven't reached mainline yet, but I've made a preliminary build at: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13272438 Could you see if those address the problem reported here. If so, I'll do a clean backport once they're fully merged upstream. Hi David I took the rpm from: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=13272443 (qemu-kvm-tools-rhev-2.9.0-6.el7.dwg201705251515.ppc64le) and install it. I rerun the test and it failed. Adding logs Sorry for taking so long for extracting the info. When attaching logs, please make sure you attach only the relevant portion. Going through 4GiB log file spanning 3 days only to look for data which happened in a span of 1 minute is definitely not fun. Anyways: The relevant portion of the VM log file (including the command line) is: 2017-05-19 21:13:54.599+0000: starting up libvirt version: 2.0.0, package: 10.el7_3.9 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-05-04-06:49:32, ppc-027.build.eng.bos.redhat.com), qemu version: 2.6.0 (qemu-kvm-rhev-2.6.0-28.el7_3.9), hostname: ibm-p8-rhevm-04.rhts.eng.bos.redhat.com LC_ALL=C PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin QEMU_AUDIO_DRV=none /usr/libexec/qemu-kvm -name guest=cpu_hotplug_vm,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-17-cpu_hotplug_vm/master-key.aes -machine pseries-rhel7.3.0,accel=kvm,usb=off -cpu POWER8 -m size=1048576k,slots=16,maxmem=4194304k -realtime mlock=off -smp 1,maxcpus=16,sockets=16,cores=1,threads=1 -numa node,nodeid=0,cpus=0,mem=1024 -uuid b7e2277f-73f7-48fa-bf0d-1b50fb9abf2e -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-17-cpu_hotplug_vm/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=2017-05-19T21:13:53,driftfix=slew -no-shutdown -boot strict=on -device nec-usb-xhci,id=usb,bus=pci.0,addr=0x3 -device spapr-vscsi,id=scsi0,reg=0x2000 -device virtio-scsi-pci,id=scsi1,bus=pci.0,addr=0x2 -device virtio-serial-pci,id=virtio-serial0,max_ports=16,bus=pci.0,addr=0x4 -drive if=none,id=drive-scsi0-0-0-0,readonly=on -device scsi-cd,bus=scsi0.0,channel=0,scsi-id=0,lun=0,drive=drive-scsi0-0-0-0,id=scsi0-0-0-0 -drive file=/rhev/data-center/c546b026-4632-4450-8e9b-17dc879b62a4/eccfa045-b125-4cec-a0a5-e40988a24cd7/images/e7adb2f8-ecc1-45e5-8a67-b442e521c397/1170b54d-74c7-4b00-a814-0c91875c8c7b,format=qcow2,if=none,id=drive-virtio-disk0,serial=e7adb2f8-ecc1-45e5-8a67-b442e521c397,cache=none,werror=stop,rerror=stop,aio=threads -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=30,id=hostnet0,vhost=on,vhostfd=32 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:1a:4a:16:01:58,bus=pci.0,addr=0x1 -chardev socket,id=charchannel0,path=/var/lib/libvirt/qemu/channels/b7e2277f-73f7-48fa-bf0d-1b50fb9abf2e.com.redhat.rhevm.vdsm,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=com.redhat.rhevm.vdsm -chardev socket,id=charchannel1,path=/var/lib/libvirt/qemu/channels/b7e2277f-73f7-48fa-bf0d-1b50fb9abf2e.org.qemu.guest_agent.0,server,nowait -device virtserialport,bus=virtio-serial0.0,nr=2,chardev=charchannel1,id=channel1,name=org.qemu.guest_agent.0 -device usb-tablet,id=input0,bus=usb.0,port=1 -device usb-kbd,id=input1,bus=usb.0,port=2 -device usb-mouse,id=input2,bus=usb.0,port=3 -vnc 10.16.69.168:0,password -k en-us -device VGA,id=video0,vgamem_mb=16,bus=pci.0,addr=0x8 -incoming defer -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -object rng-random,id=objrng0,filename=/dev/urandom -device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.0,addr=0x7 -msg timestamp=on 2017-05-19 21:13:54.604+0000: 92874: info : libvirt version: 2.0.0, package: 10.el7_3.9 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2017-05-04-06:49:32, ppc-027.build.eng.bos.redhat.com) 2017-05-19 21:13:54.604+0000: 92874: info : hostname: ibm-p8-rhevm-04.rhts.eng.bos.redhat.com 2017-05-19 21:13:54.604+0000: 92874: info : virObjectUnref:259 : OBJECT_UNREF: obj=0x3fff941b2d00 2017-05-19T21:13:54.702921Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 2017-05-19T21:13:54.703125Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config 2017-05-19T21:14:31.063848Z qemu-kvm: terminating on signal 15 from pid 6385 According to the log, the control flow was as usual: 1) libvirtd started the qemu process for incomming migration (stopped vcpus) with only a subset of vcpus active 2) libvirtd issued query-hotpluggable-cpus and query-cpus to figure out what to pass to device_add to hotplug all relevant cpus 2017-05-19 21:13:54.748+0000: 6385: info : qemuMonitorJSONIOProcessLine:211 : QEMU_MONITOR_RECV_REPLY: mon=0x3fff74027000 reply={"return": [{"props": {"core-id": 15}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 14}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 13}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 12}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 11}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 10}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 9}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 8}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 7}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 6}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 5}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 4}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 3}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 2}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 1}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 0}, "vcpus-count": 1, "qom-path": "/machine/unattached/device[1]", "type": "POWER8-spapr-cpu-core"}], "id": "libvirt-4"} 2017-05-19 21:13:54.749+0000: 6385: info : qemuMonitorJSONIOProcessLine:211 : QEMU_MONITOR_RECV_REPLY: mon=0x3fff74027000 reply={"return": [{"arch": "ppc", "current": true, "CPU": 0, "nip": 256, "qom_path": "/machine/unattached/device[1]/thread[0]", "halted": false, "thread_id": 92883}], "id": "libvirt-5"} 3) libvirtd then hotplugged vcpus 2017-05-19 21:13:54.749+0000: 6385: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x3fff74027000 buf={"execute":"device_add","arguments":{"driver":"POWER8-spapr-cpu-core","id":"vcpu1","core-id":1},"id":"libvirt-6"} 4) and queried again: 2017-05-19 21:13:54.751+0000: 6385: info : qemuMonitorJSONIOProcessLine:211 : QEMU_MONITOR_RECV_REPLY: mon=0x3fff74027000 reply={"return": [{"props": {"core-id": 15}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 14}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 13}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 12}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 11}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 10}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 9}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 8}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 7}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 6}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 5}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 4}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 3}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 2}, "vcpus-count": 1, "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 1}, "vcpus-count": 1, "qom-path": "/machine/peripheral/vcpu1", "type": "POWER8-spapr-cpu-core"}, {"props": {"core-id": 0}, "vcpus-count": 1, "qom-path": "/machine/unattached/device[1]", "type": "POWER8-spapr-cpu-core"}], "id": "libvirt-7"} 2017-05-19 21:13:54.752+0000: 6385: info : qemuMonitorJSONIOProcessLine:211 : QEMU_MONITOR_RECV_REPLY: mon=0x3fff74027000 reply={"return": [{"arch": "ppc", "current": true, "CPU": 0, "nip": 256, "qom_path": "/machine/unattached/device[1]/thread[0]", "halted": false, "thread_id": 92883}, {"arch": "ppc", "current": false, "CPU": 1, "nip": 0, "qom_path": "/machine/peripheral/vcpu1/thread[0]", "halted": true, "thread_id": 92885}], "id": "libvirt-8"} 5) VM startup (for migration) then continued and migration was invoked: 2017-05-19 21:13:54.767+0000: 6385: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x3fff74027000 buf={"execute":"migrate-incoming","arguments":{"uri":"tcp:0.0.0.0:49152"},"id":"libvirt-25"} 6) VM was resumed after migration: 2017-05-19 21:13:58.447+0000: 6385: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x3fff74027000 buf={"execute":"cont","id":"libvirt-27"} 2017-05-19 21:13:58.457+0000: 6385: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x3fff74027000 buf={"timestamp": {"seconds": 1495228438, "microseconds": 457423}, "event": "RESUME"} 7) later a request to delete 'vcpu1' was issued: 2017-05-19 21:14:24.795+0000: 6554: info : qemuMonitorSend:1009 : QEMU_MONITOR_SEND_MSG: mon=0x3fff74027000 msg={"execute":"device_del","arguments":{"id":"vcpu1"},"id":"libvirt-35"} 8) waiting for the vcpu to be deleted timed out: 2017-05-19 21:14:29.797+0000: 6554: error : qemuDomainHotplugDelVcpu:4790 : operation failed: vcpu unplug request timed out 9) the VM received a shutdown event: 2017-05-19 21:14:31.064+0000: 6385: info : qemuMonitorJSONIOProcessLine:206 : QEMU_MONITOR_RECV_EVENT: mon=0x3fff74027000 event={"timestamp": {"seconds": 1495228471, "microseconds": 64071}, "event": "SHUTDOWN"} The shutdown event was dispatched as libvirtd killed the qemu process. This means that the destroy API was probably invoked, but the log does not contain this information. The above data conform to what is designed to do in this case. Libvirtd is doing explicit vcpu hotplug using the monitor as it's impossible to query the qemu process for the data necessary to plug in the vcpus prior to starting it. Since the VM was destroyed I can't conclude whether the DEVICE_DELETED event was just delayed long enough, or the guest just didn't unplug the vcpu at all. Reproduce the issue: Version-Release number of selected component (if applicable): qemu-kvm-rhev-2.6.0-28.el7_3.9.ppc64le host:kernel-3.10.0-663.el7.ppc64le guest:kernel-3.10.0-663.el7.ppc64le Steps to Reproduce: 1.Boot up guest with command: /usr/libexec/qemu-kvm \ -m 8192 \ -smp 1,maxcpus=2,cores=2,threads=1,sockets=1 \ -vnc :0 \ -vga std \ -monitor stdio \ -device virtio-scsi-pci,bus=pci.0,addr=0x2,id=scs \ -device scsi-hd,id=scsi-hd0,drive=scsi-hd-dr0,bootindex=1 \ -drive file=rhel74-ppc64le-virtio-scsi.qcow2,if=none,id=scsi-hd-dr0,format=qcow2,cache=none \ 2.Hot plug cpu core (qemu)device_add host-spapr-cpu-core,core-id=1,id=core1 3.Boot up destination guest with additional command then hotplug cpu core: -incoming tcp:0:7777 (qemu)device_add host-spapr-cpu-core,core-id=1,id=core1 4.Migrate guest from source end. (qemu) migrate -d tcp:0:7777 5.Hot unplug cpu core from destination hmp after migration (qemu)device_del core1 6.Check cpu core in destination hmp and guest (qemu) info hotpluggable-cpus (qemu) info cpus (guest)# cat /proc/cpuinfo Actual results: The unpluged cpu core1 still exist,hot unplugging failed. And I also try it with the latest qemu-kvm-rhev-2.9.0-6.el7.ppc64le It has this issue too. Laurent, can you test and see if this is reproducible with upstream ppc-for-2.10? (In reply to David Gibson from comment #17) > Laurent, can you test and see if this is reproducible with upstream > ppc-for-2.10? I've tested ppc-for-2.10-20170525, and the problem is fixed. But I think it doesn't fix the origin of the problem, only the consequence: the origin of the problem is the hotplugged device is not in a correct state when it is hotplugged in a guest in an "incoming" state, the ppc-for-2.10 fixes the consequence of the problem by migrating the internal state from the source. We have this because of a semantic divergence between QEMU and SPAPR: - for QEMU, a hotplugged device is a device plugged using HMP/QMP (not from CLI), even if the VM is not started (inmigrate state) - for SPAPR, a hotplugged device is a device plugged while the machine is running. And in this case QEMU doesn't update internal state but relies on the OS for this part. So a hotplugged device in inmigrate state should be managed like coldplugged ones by SPAPR QEMU part. I'm going to propose a patch upstream to see if we can fix that too anyway. I've tested with qemu-kvm-rhev-2.6.0-28.el7_3.10 and it doesn't work: it's not a regression. According to comment #18, fix is part of qemu-2.10 [1] [1] the series has been applied to master: https://lists.gnu.org/archive/html/qemu-devel/2017-05/msg06569.html Test the issue on old version: Version-Release number of selected component (if applicable): qemu-kvm-rhev-2.9.0-16.el7_4.12.ppc64le Steps to Reproduce: 1.Boot up guest with command: /usr/libexec/qemu-kvm \ -device virtio-scsi-pci,id=virtio_scsi_pci0,bus=pci.0,addr=0x7 \ -drive id=drive_image1,if=none,snapshot=off,aio=native,cache=none,format=qcow2,file=ALT-Server-7.5-ppc64le-virtio-scsi.qcow2 \ -device scsi-hd,id=image1,drive=drive_image1 \ -m 8192,slots=2,maxmem=16G \ -chardev stdio,mux=on,id=serial_id_serial0,server,nowait,signal=off \ -device spapr-vty,id=serial111,chardev=serial_id_serial0 \ -nodefaults \ -device virtio-net-pci,mac=9a:79:7a:7b:7c:7d,id=idnW3nZi,vectors=4,netdev=idXOO0Sj,bus=pci.0,addr=0x8 \ -netdev tap,id=idXOO0Sj,vhost=on \ -vga std \ -vnc :4 \ -smp cpus=1,maxcpus=4,sockets=1,cores=1,threads=1 \ -mon chardev=serial_id_serial0,mode=readline \ 2.Hot plug cpu core (qemu)device_add host-spapr-cpu-core,core-id=1,id=core1 3.Boot up destination guest with additional command then hotplug cpu core: -incoming tcp:0:7777 (qemu)device_add host-spapr-cpu-core,core-id=1,id=core1 4.Migrate guest from source end. (qemu) migrate -d tcp:0:7777 5.Hot unplug cpu core from destination hmp after migration (qemu)device_del core1 6.Check cpu core in destination hmp and guest (qemu) info hotpluggable-cpus (qemu) info cpus (guest)# cat /proc/cpuinfo Actual results: Failed to hot unplug cpu core on destination end Verified the issue on the latest build: Version-Release number of selected component (if applicable): qemu-kvm-rhev-2.10.0-6.el7.ppc64le Steps to Reproduce: 1.The same steps as above Actual results: Hot unplug cpu core on destination end successful.set status to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:1104 |