Bug 1612943
Summary: | Wrong RESUME event after I/O error | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Milan Zamazal <mzamazal> | ||||||||
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Yanqiu Zhang <yanqzhan> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 7.5 | CC: | dyuan, fjin, jdenemar, jherrman, jsuchane, lmen, mtessun, rbarry, tburke, xuzhang, yafu, yanqzhan | ||||||||
Target Milestone: | rc | Keywords: | Reopened, ZStream | ||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | libvirt-4.5.0-11.el7 | Doc Type: | Bug Fix | ||||||||
Doc Text: |
Previously, when a virtual machine (VM) was paused due to an I/O error and the error was not fixed, resuming the VM caused it to become paused again, and only a generic unpause event was logged. This update ensures that suspend events generated by the libvirt service properly indicate I/O errors. As a result, attempting to resume a VM with an unfixed I/O error now logs the correct event type.
|
Story Points: | --- | ||||||||
Clone Of: | |||||||||||
: | 1634758 1634759 (view as bug list) | Environment: | |||||||||
Last Closed: | 2019-08-06 13:13:56 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 1526025, 1634758, 1634759, 1651787 | ||||||||||
Attachments: |
|
Description
Milan Zamazal
2018-08-06 15:07:56 UTC
Created attachment 1473648 [details]
Vdsm log
Reproduce with: libvirt-3.9.0-14.el7_5.7.x86_64 qemu-kvm-rhev-2.10.0-21.el7_5.5.x86_64 Steps: 1. Run a VM with a second disk attached. <disk type='file' device='disk'> <driver name='qemu' type='raw'/> <source file='/var/lib/libvirt/images/mypool/disk1.img'/> <target dev='vdb' bus='virtio'/> <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/> </disk> # virsh start V-7.5 Domain V-7.5 started 2. Cause an I/O error by overwriting on the disk by https://bugzilla.redhat.com/show_bug.cgi?id=1566153#c7 3. Wait until the VM gets paused: Guest os can not be interacted, can see status by: # virsh -r domstate V-7.5 --reason paused (I/O error) 4. Try to manually resume the VM. # virsh resume V-7.5 Domain V-7.5 resumed # virsh -r domstate V-7.5 --reason running (unpaused) 5. Check event and libvirtd.log # cat virsh-event.log |grep lifecycle event 'lifecycle' for domain V-7.5: Suspended I/O Error event 'lifecycle' for domain V-7.5: Resumed Unpaused event 'lifecycle' for domain V-7.5: Suspended I/O Error event 'lifecycle' for domain V-7.5: Resumed Unpaused The lifecycle events in order are "3 2", "4 0", "3 2", "4 0" in libvirtd.log: '3' and '4' are virDomainEventType: VIR_DOMAIN_EVENT_SUSPENDED = 3, VIR_DOMAIN_EVENT_RESUMED = 4, '2' and '0' are details on the cause of a lifecycle event: VIR_DOMAIN_EVENT_SUSPENDED_IOERROR = 2, /* Suspended due to a disk I/O error */ VIR_DOMAIN_EVENT_RESUMED_UNPAUSED = 0, /* Normal resume due to admin unpause */ # cat libvirtd.log-20180808 |grep "Relaying domain lifecycle event"|uniq 2018-08-07 12:54:01.349+0000: 1753: debug : remoteRelayDomainEventLifecycle:312 : Relaying domain lifecycle event 3 2, callback 0 legacy 0 ... 2018-08-07 12:55:33.883+0000: 1753: debug : remoteRelayDomainEventLifecycle:312 : Relaying domain lifecycle event 4 0, callback 0 legacy 0 ... 2018-08-07 12:55:33.888+0000: 1753: debug : remoteRelayDomainEventLifecycle:312 : Relaying domain lifecycle event 3 2, callback 0 legacy 0 ... 2018-08-07 12:55:34.046+0000: 1753: debug : remoteRelayDomainEventLifecycle:312 : Relaying domain lifecycle event 4 0, callback 0 legacy 0 ... Actual results: 1. In step4, guest is resumed to running status even though the ioerror still exists, and guest os is still inaccessible. 2. In step5, extra 'resume' occurred in lifecycle events, and reason is "admin unpause". Additional info: The qemu events in order are as expected: BLOCK_IO_ERROR"->"STOP"->"RESUME"->"BLOCK_IO_ERROR"->"STOP": # cat /var/log/libvirt/libvirtd.log |grep QEMU_MONITOR_RECV_EVENT 2018-08-07 12:54:01.345+0000: 1753: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f1b74027d50 event={"timestamp": {"seconds": 1533646441, "microseconds": 345352}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "nospace": true, "__com.redhat_reason": "enospc", "node-name": "#block329", "reason": "No space left on device", "operation": "write", "action": "stop"}} ... 2018-08-07 12:54:01.481+0000: 1753: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f1b74027d50 event={"timestamp": {"seconds": 1533646441, "microseconds": 356659}, "event": "STOP"} 2018-08-07 12:55:33.882+0000: 1753: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f1b74027d50 event={"timestamp": {"seconds": 1533646533, "microseconds": 882140}, "event": "RESUME"} 2018-08-07 12:55:33.884+0000: 1753: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f1b74027d50 event={"timestamp": {"seconds": 1533646533, "microseconds": 883390}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "nospace": true, "__com.redhat_reason": "enospc", "node-name": "#block329", "reason": "No space left on device", "operation": "write", "action": "stop"}} ... 2018-08-07 12:55:34.033+0000: 1753: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f1b74027d50 event={"timestamp": {"seconds": 1533646533, "microseconds": 906884}, "event": "STOP"} Created attachment 1474307 [details]
logs_for_comment3
*** This bug has been marked as a duplicate of bug 1582122 *** Oops. Totally ok with me Patches sent upstream for review: https://www.redhat.com/archives/libvir-list/2018-September/msg00548.html This is now fixed upstream by commit 55af06187c48a01192764d8638b85739b0178fe0 Refs: v4.7.0-298-g55af06187c Author: Jiri Denemark <jdenemar> AuthorDate: Mon Sep 10 15:10:54 2018 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Tue Sep 25 20:42:25 2018 +0200 qemu: Properly report VIR_DOMAIN_EVENT_RESUMED_FROM_SNAPSHOT VIR_DOMAIN_EVENT_RESUMED_FROM_SNAPSHOT was defined but not used anywhere in our event generation code. This fixes qemuDomainRevertToSnapshot to properly report why the domain was resumed. Signed-off-by: Jiri Denemark <jdenemar> Reviewed-by: John Ferlan <jferlan> commit 54b5b0ac3945ad5417b67bec8443cf6e7b3d482b Refs: v4.7.0-299-g54b5b0ac39 Author: Jiri Denemark <jdenemar> AuthorDate: Tue Sep 11 19:26:07 2018 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Tue Sep 25 20:42:25 2018 +0200 qemu: Report more appropriate running reasons This patch replaces some rather generic VIR_DOMAIN_RUNNING_UNPAUSED reasons when changing domain state to running with more specific ones. All of them are done when libvirtd reconnects to an existing domain after being restarted and sees an unfinished migration or save. Signed-off-by: Jiri Denemark <jdenemar> Reviewed-by: John Ferlan <jferlan> commit 5dab984ed0cd0332e59d719420ab2f9d009b952f Refs: v4.7.0-300-g5dab984ed0 Author: Jiri Denemark <jdenemar> AuthorDate: Mon Sep 10 19:41:53 2018 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Tue Sep 25 20:42:25 2018 +0200 qemu: Pass running reason to RESUME event handler Whenever we get the RESUME event from QEMU, we change the state of the affected domain to VIR_DOMAIN_RUNNING with VIR_DOMAIN_RUNNING_UNPAUSED reason. This is fine if the domain is resumed unexpectedly, but when we sent "cont" to QEMU we usually have a better reason for the state change. The better reason is used in qemuProcessStartCPUs which also sets the domain state to running if qemuMonitorStartCPUs reports success. Thus we may end up with two state updates in a row, but the final reason is correct. This patch is a preparation for dropping the state change done in qemuMonitorStartCPUs for which we need to pass the actual running reason to the RESUME event handler and use it there instead of VIR_DOMAIN_RUNNING_UNPAUSED. Signed-off-by: Jiri Denemark <jdenemar> Reviewed-by: John Ferlan <jferlan> commit 8ae9b49f5a4a02f57a1dfa20d4fe04c3d40a4665 Refs: v4.7.0-301-g8ae9b49f5a Author: Jiri Denemark <jdenemar> AuthorDate: Tue Sep 11 15:13:08 2018 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Tue Sep 25 20:42:25 2018 +0200 qemu: Map running reason to resume event detail Thanks to the previous commit the RESUME event handler knows what reason should be used when changing the domain state to VIR_DOMAIN_RUNNING, but the emitted VIR_DOMAIN_EVENT_RESUMED event still uses a generic VIR_DOMAIN_EVENT_RESUMED_UNPAUSED detail. Luckily, the event detail can be easily deduced from the running reason, which saves us from having to pass one more value to the handler. Signed-off-by: Jiri Denemark <jdenemar> Reviewed-by: John Ferlan <jferlan> commit e6d77a75c4bf0c017d62b717b75e4bb6aa7a456b Refs: v4.7.0-302-ge6d77a75c4 Author: Jiri Denemark <jdenemar> AuthorDate: Wed Sep 12 14:34:33 2018 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Tue Sep 25 20:42:25 2018 +0200 qemu: Avoid duplicate resume events and state changes The only place where VIR_DOMAIN_EVENT_RESUMED should be generated is the RESUME event handler to make sure we don't generate duplicate events or state changes. In the worse case the duplicity can revert or cover changes done by other event handlers. For example, after QEMU sent RESUME, BLOCK_IO_ERROR, and STOP events we could happily mark the domain as running and report VIR_DOMAIN_EVENT_RESUMED to registered clients. https://bugzilla.redhat.com/show_bug.cgi?id=1612943 Signed-off-by: Jiri Denemark <jdenemar> Reviewed-by: John Ferlan <jferlan> The "qemu: Avoid duplicate resume events and state changes" patch causes a regression with fake reboot (virsh reboot --mode acpi) and automatic restart of the domain when it crashes (<on_crash>restart</on_crash>). The domain is not switched back to running once it's restarted and the resume event is not emitted by libvirt. The patch fixing regression from comment 16 was sent upstream for review: https://www.redhat.com/archives/libvir-list/2018-November/msg00250.html The regression is now fixed upstream by commit e47949357ba268e7e8c3adea7c262b84fa002302 Refs: v4.9.0-21-ge47949357b Author: Jiri Denemark <jdenemar> AuthorDate: Wed Nov 7 14:34:52 2018 +0100 Commit: Jiri Denemark <jdenemar> CommitDate: Thu Nov 8 09:08:58 2018 +0100 qemu: Don't ignore resume events Since commit v4.7.0-302-ge6d77a75c4 processing RESUME event is mandatory for updating domain state. But the event handler explicitly ignored this event in some cases. Thus the state would be wrong after a fake reboot or when a domain was rebooted after it crashed. BTW, the code to ignore RESUME event after SHUTDOWN didn't make sense even before making RESUME event mandatory. Most likely it was there as a result of careless copy&paste from qemuProcessHandleStop. The corresponding debug message was clarified since the original state does not have to be "paused" only and while we have a "resumed" event, the state is called "running". https://bugzilla.redhat.com/show_bug.cgi?id=1612943 Signed-off-by: Jiri Denemark <jdenemar> Verified this bug on rhel7.7 with: libvirt-4.5.0-12.el7.x86_64 qemu-kvm-rhev-2.12.0-26.el7.x86_64 kernel-3.10.0-1034/-1036.el7.x86_64 Steps and results are same with Bug 1634758. Mark as verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:2294 |