Red Hat Bugzilla – Bug 1320447
[RFE] Report memory hotunplug failure
Last modified: 2016-11-03 14:40:07 EDT
When trying to detach a hotplugged memory device (and possibly other devices) via `virsh detach-device' (or virDomainDetachDeviceFlags), libvirt doesn't report failure of the action. It always returns with success, perhaps after some timeout, and emits a corresponding event only in case of device removal success. While this is a documented feature, it has some important drawbacks: - If device detach fails quickly, libvirt unnecessarily waits for 5 seconds before it timeouts and returns from virDomainDetachDeviceFlags call. - After virDomainDetachDeviceFlags call finishes, the caller can check for success (either by watching VIR_DOMAIN_EVENT_ID_DEVICE_REMOVED event or by checking the domain XML) but can't distinguish between failure and pending request. If the device removal event is not received and the device is still present in domain XML then it may mean both that the operation is still in progress or that it has already failed. So the caller is uncertain about the result. QEMU already emits an event on memory hotunplug failure, for example: event ACPI_DEVICE_OST at 1458308024.447310 for domain centos: {"info":{"device":"dimm0","source":3,"status":132,"slot":"0","slot-type":"DIMM"}} ... action start event ACPI_DEVICE_OST at 1458308024.461017 for domain centos: {"info":{"device":"dimm0","source":3,"status":1,"slot":"0","slot-type":"DIMM"}} ... action failure A suggested improvement is to watch for the failure event from QEMU and react accordingly in libvirt, that means: - Returning from virDomainDetachDeviceFlags immediately not only after receiving a QEMU success event but also after receiving a QEMU failure event. - Returning failure (instead of success) in case of device removal failure before the call timeouts. - Emitting a newly introduced libvirt event on device removal failure. That would solve both the drawbacks described above and the result of the operation would be clear (immediately, in common cases) after returning from virDomainDetachDeviceFlags: - If failure is returned then hotunplug failed. - If success is returned and the device is no longer present in the domain XML or the device removal success event has been received then hotunplug was successful. - Otherwise the operation is still pending and the caller should watch for the corresponding events to be informed about the final result.
The functionality was added upstream by: commit 0ad64e20d8f8ce49645b3147ab3bcbf2ae5de41a Author: Peter Krempa <pkrempa@redhat.com> Date: Fri Apr 1 17:48:20 2016 +0200 qemu: process: Wire up ACPI OST events to notify users of failed memory unplug Since qemu is now able to notify us that the guest rejected the memory unplug operation we can relay this to the user and make the API fail right away. Additionally document the possible values from the ACPI docs for future reference. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1320447 commit 650e8d2c590260dabc1f7426565792da4ccb74ab Author: Peter Krempa <pkrempa@redhat.com> Date: Fri Apr 1 16:41:08 2016 +0200 qemu: monitor: Add support for ACPI_DEVICE_OST event handling The event is emitted on ACPI OSPM Status Indication events. ACPI standard documentation describes the method as: This object is an optional control method that is invoked by OSPM to indicate processing status to the platform. During device ejection, device hot add, or other event processing, OSPM may need to perform specific handshaking with the platform. OSPM may also need to indicate to the platform its inability to complete a requested operation; for example, when a user presses an ejection button for a device that is currently in use or is otherwise currently incapable of being ejected. In this case, the processing of the ACPI Eject Request notification by OSPM fails. OSPM may indicate this failure to the platform through the invocation of the _OST control method. As a result of the status notification indicating ejection failure, the platform may take certain action including reissuing the notification or perhaps turning on an appropriate indicator light to signal the failure to the user. commit 5be120710e7865b1ee198d398176b75253fb0b3f Author: Peter Krempa <pkrempa@redhat.com> Date: Wed Mar 30 18:09:45 2016 +0200 Add VIR_DOMAIN_EVENT_ID_DEVICE_REMOVAL_FAILED event Since we didn't opt to use one single event for device lifecycle for a VM we are missing one last event if the device removal failed. This event will be emitted once we asked to eject the device but for some reason it is not possible.
Hi Peter, I can not find the error info when hot unplug the memory device from the guest without OS and still find the device info in the xml of guest [root@ibm-x3850x5-05 jishao]# rpm -q libvirt libvirt-1.3.4-1.el7.x86_64 (1)start a guest without OS [root@ibm-x3850x5-05 images]# virsh start r7.2 Domain r7.2 started (2)[root@ibm-x3850x5-05 jishao]# virsh dumpxml r7.2 | grep dim -A4 [root@ibm-x3850x5-05 jishao]# [root@ibm-x3850x5-05 jishao]# (3)[root@ibm-x3850x5-05 jishao]# cat memdevice.xml <memory model='dimm'> <target> <size unit='MiB'>500</size> <node>0</node> </target> </memory> (4)[root@ibm-x3850x5-05 jishao]# virsh attach-device r7.2 memdevice.xml Device attached successfully (5)[root@ibm-x3850x5-05 jishao]# virsh dumpxml r7.2 | grep dim -A4 <memory model='dimm'> <target> <size unit='KiB'>512000</size> <node>0</node> </target> <alias name='dimm0'/> <address type='dimm' slot='0' base='0x100000000'/> </memory> (6)[root@ibm-x3850x5-05 jishao]# virsh detach-device r7.2 memdevice.xml Device detached successfully (7)[root@ibm-x3850x5-05 jishao]# echo $? 0 (8)[root@ibm-x3850x5-05 jishao]# virsh dumpxml r7.2 | grep dim -A4 <memory model='dimm'> <target> <size unit='KiB'>512000</size> <node>0</node> </target> <alias name='dimm0'/> <address type='dimm' slot='0' base='0x100000000'/> </memory>
(In reply to Jingjing Shao from comment #6) > Hi Peter, > > I can not find the error info when hot unplug the memory device from the > guest without OS and still find the device info in the xml of guest The notification that hot-unplug failed is delivered only when there is an OS that rejects the memory unplug request. Without OS it will behave like it did until now by not delivering any event and the device stays in the XML.
Verify this bug with libvirt-1.3.5-1.el7.x86_64: 0. use stap to watch qemu monitor and virsh event in 2 window # stap qemu-monitor.stp 0.000 begin # virsh event rhel7.0-rhel --all --loop 1. prepare a guest with os and enable memory hotplug: # virsh dumpxml rhel7.0-rhel <domain type='kvm' id='5'> <name>rhel7.0-rhel</name> <uuid>67c7a123-5415-4136-af62-a2ee098ba6cd</uuid> <maxMemory slots='16' unit='KiB'>15243264</maxMemory> <memory unit='KiB'>4194304</memory> <currentMemory unit='KiB'>4194304</currentMemory> ... <cpu> <numa> <cell id='0' cpus='0,2' memory='2097152' unit='KiB'/> <cell id='1' cpus='1,3' memory='2097152' unit='KiB'/> </numa> </cpu> 2. hotplug a 1G memory device: # cat mem2.xml <memory model='dimm'> <target> <size unit='G'>1</size> <node>0</node> </target> </memory> # virsh attach-device rhel7.0-rhel mem2.xml Device attached successfully 3. we can see the event in stap window 329.486 > 0x7ffa84231d30 {"execute":"object-add","arguments":{"qom-type":"memory-backend-ram","id":"memdimm0","props":{"size":1073741824}},"id":"libvirt-50"} 329.493 < 0x7ffa84231d30 {"return": {}, "id": "libvirt-50"} 329.493 > 0x7ffa84231d30 {"execute":"device_add","arguments":{"driver":"pc-dimm","node":"0","memdev":"memdimm0","id":"dimm0"},"id":"libvirt-51"} 329.518 < 0x7ffa84231d30 {"return": {}, "id": "libvirt-51"} 4, hot-unplug 1g memory: # virsh detach-device rhel7.0-rhel mem2.xml error: Failed to detach device from mem2.xml error: operation failed: unplug of device was rejected by the guest 5. check the stap window: 426.483 > 0x7ffa84231d30 {"execute":"device_del","arguments":{"id":"dimm0"},"id":"libvirt-55"} 426.488 < 0x7ffa84231d30 {"return": {}, "id": "libvirt-55"} 426.493 ! 0x7ffa84231d30 {"timestamp": {"seconds": 1465973894, "microseconds": 201574}, "event": "ACPI_DEVICE_OST", "data": {"info": {"device": "dimm0", "source": 3, "status": 132, "slot": "0", "slot-type": "DIMM"}}} 426.669 ! 0x7ffa84231d30 {"timestamp": {"seconds": 1465973894, "microseconds": 377542}, "event": "ACPI_DEVICE_OST", "data": {"info": {"device": "dimm0", "source": 3, "status": 1, "slot": "0", "slot-type": "DIMM"}}} 6. check virsh event window # virsh event rhel7.0-rhel --all --loop ... event 'device-removal-failed' for domain rhel7.0-rhel: dimm0 ... 7. login guest and check the dmesg: # dmesg ... [ 751.307067] ACPI: \_SB_.MP00: ACPI_NOTIFY_EJECT_REQUEST event [ 751.329258] Offlined Pages 32768 [ 751.338795] Offlined Pages 32768 [ 751.352214] Offlined Pages 32768 [ 751.359408] Offlined Pages 32768 [ 751.372588] Offlined Pages 32768 [ 751.376824] memory memory41: Offline failed. 8. retest with a guest without os # virsh start rhel7.0-rhel-noos Domain rhel7.0-rhel-noos started # virsh attach-device rhel7.0-rhel-noos mem2.xml Device attached successfully # virsh detach-device rhel7.0-rhel-noos mem2.xml Device detached successfully and there is no ACPI_DEVICE_OST event on stap window 9. test libvirt-python with libvirt-python-1.3.5-1.el7.x86_64: # python Python 2.7.5 (default, Oct 11 2015, 17:47:16) [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import libvirt >>> conn = libvirt.open() >>> lista = dir(libvirt) >>> "VIR_DOMAIN_EVENT_ID_DEVICE_REMOVAL_FAILED" in lista True 10. check virsh event --list output: # virsh event --list lifecycle reboot rtc-change watchdog io-error graphics io-error-reason control-error block-job disk-change tray-change pm-wakeup pm-suspend balloon-change pm-suspend-disk device-removed block-job-2 tunable agent-lifecycle device-added migration-iteration job-completed device-removal-failed 11. test unplug success to make sure it won't break old feature: # cat mem1.xml <memory model='dimm'> <target> <size unit='KiB'>131072</size> <node>0</node> </target> </memory> # virsh attach-device rhel7.0-rhel mem1.xml Device attached successfully # virsh detach-device rhel7.0-rhel mem1.xml Device detached successfully there is no device-removal-failed event in event window
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-2577.html