Bug 1612943
| Summary: | Wrong RESUME event after I/O error | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Milan Zamazal <mzamazal> | ||||||||
| Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Yanqiu Zhang <yanqzhan> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | high | ||||||||||
| Version: | 7.5 | CC: | dyuan, fjin, jdenemar, jherrman, jsuchane, lmen, mtessun, rbarry, tburke, xuzhang, yafu, yanqzhan | ||||||||
| Target Milestone: | rc | Keywords: | Reopened, ZStream | ||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | libvirt-4.5.0-11.el7 | Doc Type: | Bug Fix | ||||||||
| Doc Text: |
Previously, when a virtual machine (VM) was paused due to an I/O error and the error was not fixed, resuming the VM caused it to become paused again, and only a generic unpause event was logged. This update ensures that suspend events generated by the libvirt service properly indicate I/O errors. As a result, attempting to resume a VM with an unfixed I/O error now logs the correct event type.
|
Story Points: | --- | ||||||||
| Clone Of: | |||||||||||
| : | 1634758 1634759 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2019-08-06 13:13:56 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1526025, 1634758, 1634759, 1651787 | ||||||||||
| Attachments: |
|
||||||||||
|
Description
Milan Zamazal
2018-08-06 15:07:56 UTC
Created attachment 1473648 [details]
Vdsm log
Reproduce with:
libvirt-3.9.0-14.el7_5.7.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.5.x86_64
Steps:
1. Run a VM with a second disk attached.
<disk type='file' device='disk'>
<driver name='qemu' type='raw'/>
<source file='/var/lib/libvirt/images/mypool/disk1.img'/>
<target dev='vdb' bus='virtio'/>
<address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
</disk>
# virsh start V-7.5
Domain V-7.5 started
2. Cause an I/O error by overwriting on the disk by https://bugzilla.redhat.com/show_bug.cgi?id=1566153#c7
3. Wait until the VM gets paused:
Guest os can not be interacted, can see status by:
# virsh -r domstate V-7.5 --reason
paused (I/O error)
4. Try to manually resume the VM.
# virsh resume V-7.5
Domain V-7.5 resumed
# virsh -r domstate V-7.5 --reason
running (unpaused)
5. Check event and libvirtd.log
# cat virsh-event.log |grep lifecycle
event 'lifecycle' for domain V-7.5: Suspended I/O Error
event 'lifecycle' for domain V-7.5: Resumed Unpaused
event 'lifecycle' for domain V-7.5: Suspended I/O Error
event 'lifecycle' for domain V-7.5: Resumed Unpaused
The lifecycle events in order are "3 2", "4 0", "3 2", "4 0" in libvirtd.log:
'3' and '4' are virDomainEventType:
VIR_DOMAIN_EVENT_SUSPENDED = 3,
VIR_DOMAIN_EVENT_RESUMED = 4,
'2' and '0' are details on the cause of a lifecycle event:
VIR_DOMAIN_EVENT_SUSPENDED_IOERROR = 2, /* Suspended due to a disk I/O error */
VIR_DOMAIN_EVENT_RESUMED_UNPAUSED = 0, /* Normal resume due to admin unpause */
# cat libvirtd.log-20180808 |grep "Relaying domain lifecycle event"|uniq
2018-08-07 12:54:01.349+0000: 1753: debug : remoteRelayDomainEventLifecycle:312 : Relaying domain lifecycle event 3 2, callback 0 legacy 0
...
2018-08-07 12:55:33.883+0000: 1753: debug : remoteRelayDomainEventLifecycle:312 : Relaying domain lifecycle event 4 0, callback 0 legacy 0
...
2018-08-07 12:55:33.888+0000: 1753: debug : remoteRelayDomainEventLifecycle:312 : Relaying domain lifecycle event 3 2, callback 0 legacy 0
...
2018-08-07 12:55:34.046+0000: 1753: debug : remoteRelayDomainEventLifecycle:312 : Relaying domain lifecycle event 4 0, callback 0 legacy 0
...
Actual results:
1. In step4, guest is resumed to running status even though the ioerror still exists, and guest os is still inaccessible.
2. In step5, extra 'resume' occurred in lifecycle events, and reason is "admin unpause".
Additional info:
The qemu events in order are as expected: BLOCK_IO_ERROR"->"STOP"->"RESUME"->"BLOCK_IO_ERROR"->"STOP":
# cat /var/log/libvirt/libvirtd.log |grep QEMU_MONITOR_RECV_EVENT
2018-08-07 12:54:01.345+0000: 1753: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f1b74027d50 event={"timestamp": {"seconds": 1533646441, "microseconds": 345352}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "nospace": true, "__com.redhat_reason": "enospc", "node-name": "#block329", "reason": "No space left on device", "operation": "write", "action": "stop"}}
...
2018-08-07 12:54:01.481+0000: 1753: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f1b74027d50 event={"timestamp": {"seconds": 1533646441, "microseconds": 356659}, "event": "STOP"}
2018-08-07 12:55:33.882+0000: 1753: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f1b74027d50 event={"timestamp": {"seconds": 1533646533, "microseconds": 882140}, "event": "RESUME"}
2018-08-07 12:55:33.884+0000: 1753: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f1b74027d50 event={"timestamp": {"seconds": 1533646533, "microseconds": 883390}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk1", "nospace": true, "__com.redhat_reason": "enospc", "node-name": "#block329", "reason": "No space left on device", "operation": "write", "action": "stop"}}
...
2018-08-07 12:55:34.033+0000: 1753: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f1b74027d50 event={"timestamp": {"seconds": 1533646533, "microseconds": 906884}, "event": "STOP"}
Created attachment 1474307 [details]
logs_for_comment3
*** This bug has been marked as a duplicate of bug 1582122 *** Oops. Totally ok with me Patches sent upstream for review: https://www.redhat.com/archives/libvir-list/2018-September/msg00548.html This is now fixed upstream by
commit 55af06187c48a01192764d8638b85739b0178fe0
Refs: v4.7.0-298-g55af06187c
Author: Jiri Denemark <jdenemar>
AuthorDate: Mon Sep 10 15:10:54 2018 +0200
Commit: Jiri Denemark <jdenemar>
CommitDate: Tue Sep 25 20:42:25 2018 +0200
qemu: Properly report VIR_DOMAIN_EVENT_RESUMED_FROM_SNAPSHOT
VIR_DOMAIN_EVENT_RESUMED_FROM_SNAPSHOT was defined but not used anywhere
in our event generation code. This fixes qemuDomainRevertToSnapshot to
properly report why the domain was resumed.
Signed-off-by: Jiri Denemark <jdenemar>
Reviewed-by: John Ferlan <jferlan>
commit 54b5b0ac3945ad5417b67bec8443cf6e7b3d482b
Refs: v4.7.0-299-g54b5b0ac39
Author: Jiri Denemark <jdenemar>
AuthorDate: Tue Sep 11 19:26:07 2018 +0200
Commit: Jiri Denemark <jdenemar>
CommitDate: Tue Sep 25 20:42:25 2018 +0200
qemu: Report more appropriate running reasons
This patch replaces some rather generic VIR_DOMAIN_RUNNING_UNPAUSED
reasons when changing domain state to running with more specific ones.
All of them are done when libvirtd reconnects to an existing domain
after being restarted and sees an unfinished migration or save.
Signed-off-by: Jiri Denemark <jdenemar>
Reviewed-by: John Ferlan <jferlan>
commit 5dab984ed0cd0332e59d719420ab2f9d009b952f
Refs: v4.7.0-300-g5dab984ed0
Author: Jiri Denemark <jdenemar>
AuthorDate: Mon Sep 10 19:41:53 2018 +0200
Commit: Jiri Denemark <jdenemar>
CommitDate: Tue Sep 25 20:42:25 2018 +0200
qemu: Pass running reason to RESUME event handler
Whenever we get the RESUME event from QEMU, we change the state of the
affected domain to VIR_DOMAIN_RUNNING with VIR_DOMAIN_RUNNING_UNPAUSED
reason. This is fine if the domain is resumed unexpectedly, but when we
sent "cont" to QEMU we usually have a better reason for the state
change. The better reason is used in qemuProcessStartCPUs which also
sets the domain state to running if qemuMonitorStartCPUs reports
success. Thus we may end up with two state updates in a row, but the
final reason is correct.
This patch is a preparation for dropping the state change done in
qemuMonitorStartCPUs for which we need to pass the actual running reason
to the RESUME event handler and use it there instead of
VIR_DOMAIN_RUNNING_UNPAUSED.
Signed-off-by: Jiri Denemark <jdenemar>
Reviewed-by: John Ferlan <jferlan>
commit 8ae9b49f5a4a02f57a1dfa20d4fe04c3d40a4665
Refs: v4.7.0-301-g8ae9b49f5a
Author: Jiri Denemark <jdenemar>
AuthorDate: Tue Sep 11 15:13:08 2018 +0200
Commit: Jiri Denemark <jdenemar>
CommitDate: Tue Sep 25 20:42:25 2018 +0200
qemu: Map running reason to resume event detail
Thanks to the previous commit the RESUME event handler knows what reason
should be used when changing the domain state to VIR_DOMAIN_RUNNING, but
the emitted VIR_DOMAIN_EVENT_RESUMED event still uses a generic
VIR_DOMAIN_EVENT_RESUMED_UNPAUSED detail. Luckily, the event detail can
be easily deduced from the running reason, which saves us from having to
pass one more value to the handler.
Signed-off-by: Jiri Denemark <jdenemar>
Reviewed-by: John Ferlan <jferlan>
commit e6d77a75c4bf0c017d62b717b75e4bb6aa7a456b
Refs: v4.7.0-302-ge6d77a75c4
Author: Jiri Denemark <jdenemar>
AuthorDate: Wed Sep 12 14:34:33 2018 +0200
Commit: Jiri Denemark <jdenemar>
CommitDate: Tue Sep 25 20:42:25 2018 +0200
qemu: Avoid duplicate resume events and state changes
The only place where VIR_DOMAIN_EVENT_RESUMED should be generated is the
RESUME event handler to make sure we don't generate duplicate events or
state changes. In the worse case the duplicity can revert or cover
changes done by other event handlers.
For example, after QEMU sent RESUME, BLOCK_IO_ERROR, and STOP events
we could happily mark the domain as running and report
VIR_DOMAIN_EVENT_RESUMED to registered clients.
https://bugzilla.redhat.com/show_bug.cgi?id=1612943
Signed-off-by: Jiri Denemark <jdenemar>
Reviewed-by: John Ferlan <jferlan>
The "qemu: Avoid duplicate resume events and state changes" patch causes a regression with fake reboot (virsh reboot --mode acpi) and automatic restart of the domain when it crashes (<on_crash>restart</on_crash>). The domain is not switched back to running once it's restarted and the resume event is not emitted by libvirt. The patch fixing regression from comment 16 was sent upstream for review: https://www.redhat.com/archives/libvir-list/2018-November/msg00250.html The regression is now fixed upstream by
commit e47949357ba268e7e8c3adea7c262b84fa002302
Refs: v4.9.0-21-ge47949357b
Author: Jiri Denemark <jdenemar>
AuthorDate: Wed Nov 7 14:34:52 2018 +0100
Commit: Jiri Denemark <jdenemar>
CommitDate: Thu Nov 8 09:08:58 2018 +0100
qemu: Don't ignore resume events
Since commit v4.7.0-302-ge6d77a75c4 processing RESUME event is mandatory
for updating domain state. But the event handler explicitly ignored this
event in some cases. Thus the state would be wrong after a fake reboot
or when a domain was rebooted after it crashed.
BTW, the code to ignore RESUME event after SHUTDOWN didn't make sense
even before making RESUME event mandatory. Most likely it was there as a
result of careless copy&paste from qemuProcessHandleStop.
The corresponding debug message was clarified since the original state
does not have to be "paused" only and while we have a "resumed" event,
the state is called "running".
https://bugzilla.redhat.com/show_bug.cgi?id=1612943
Signed-off-by: Jiri Denemark <jdenemar>
Verified this bug on rhel7.7 with: libvirt-4.5.0-12.el7.x86_64 qemu-kvm-rhev-2.12.0-26.el7.x86_64 kernel-3.10.0-1034/-1036.el7.x86_64 Steps and results are same with Bug 1634758. Mark as verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:2294 |