Bug 1634759
| Summary: | Wrong RESUME event after I/O error [rhel-7.6.z] | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Oneata Mircea Teodor <toneata> |
| Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> |
| Status: | CLOSED ERRATA | QA Contact: | Yanqiu Zhang <yanqzhan> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 7.5 | CC: | dyuan, fjin, jdenemar, jherrman, jsuchane, lizhu, lmen, mtessun, mzamazal, rbarry, tburke, xuzhang, yafu, yanqzhan |
| Target Milestone: | rc | Keywords: | Reopened, ZStream |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | libvirt-4.5.0-10.el7_6.3 | Doc Type: | Bug Fix |
| Doc Text: |
Previously, when a virtual machine (VM) was paused due to an I/O error and the error was not fixed, resuming the VM caused it to become paused again, and only a generic unpause event was logged. This update ensures that suspend events generated by the libvirt service properly indicate I/O errors. As a result, attempting to resume a VM with an unfixed I/O error now logs the correct event type.
|
Story Points: | --- |
| Clone Of: | 1612943 | Environment: | |
| Last Closed: | 2018-11-27 01:22:40 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | 1612943 | ||
| Bug Blocks: | |||
|
Description
Oneata Mircea Teodor
2018-10-01 14:46:32 UTC
The "qemu: Avoid duplicate resume events and state changes" patch causes a regression with fake reboot (virsh reboot --mode acpi) and automatic restart of the domain when it crashes (<on_crash>restart</on_crash>). The domain is not switched back to running once it's restarted and the resume event is not emitted by libvirt. The patch fixing regression from comment 4 was sent upstream for review: https://www.redhat.com/archives/libvir-list/2018-November/msg00250.html The regression is now fixed upstream by
commit e47949357ba268e7e8c3adea7c262b84fa002302
Refs: v4.9.0-21-ge47949357b
Author: Jiri Denemark <jdenemar>
AuthorDate: Wed Nov 7 14:34:52 2018 +0100
Commit: Jiri Denemark <jdenemar>
CommitDate: Thu Nov 8 09:08:58 2018 +0100
qemu: Don't ignore resume events
Since commit v4.7.0-302-ge6d77a75c4 processing RESUME event is mandatory
for updating domain state. But the event handler explicitly ignored this
event in some cases. Thus the state would be wrong after a fake reboot
or when a domain was rebooted after it crashed.
BTW, the code to ignore RESUME event after SHUTDOWN didn't make sense
even before making RESUME event mandatory. Most likely it was there as a
result of careless copy&paste from qemuProcessHandleStop.
The corresponding debug message was clarified since the original state
does not have to be "paused" only and while we have a "resumed" event,
the state is called "running".
https://bugzilla.redhat.com/show_bug.cgi?id=1612943
Signed-off-by: Jiri Denemark <jdenemar>
Verify this bug with:
libvirt-4.5.0-10.el7_6.3.x86_64
qemu-kvm-rhev-2.12.0-19.el7_6.2.x86_64
Steps to verify:
Scenario 1: Guest IOERROR and manual resume:
# cat event.log|grep lifecycle
event 'lifecycle' for domain rhel7.6: Suspended I/O Error
event 'lifecycle' for domain rhel7.6: Resumed Unpaused
event 'lifecycle' for domain rhel7.6: Suspended I/O Error
# cat /var/log/libvirt/libvirtd.log|grep "Transitioned guest"
2018-11-09 08:04:32.240+0000: 19726: debug : qemuProcessHandleIOError:867 : Transitioned guest rhel7.6 to paused state due to IO error
2018-11-09 08:05:05.904+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'unpaused', event detail 0
2018-11-09 08:05:05.908+0000: 19726: debug : qemuProcessHandleIOError:867 : Transitioned guest rhel7.6 to paused state due to IO error
Scenario 2: acpi reboot and coredump-restart
1.acpi reboot
Start guest with acpi and reboot guest:
<features>
<acpi/>
<apic/>
..
# virsh reboot rhel7.6 --mode acpi
Domain rhel7.6 is being rebooted
# virsh event --loop --all
event 'agent-lifecycle' for domain rhel7.6: state: 'disconnected' reason: 'channel event'
event 'lifecycle' for domain rhel7.6: Shutdown Finished after guest request
event 'reboot' for domain rhel7.6
event 'lifecycle' for domain rhel7.6: Resumed Unpaused
event 'reboot' for domain rhel7.6
event 'agent-lifecycle' for domain rhel7.6: state: 'connected' reason: 'channel event'
Check guest os rebooted and can login successfully.
# virsh domstate rhel7.6 --reason
running (booted)
# cat libvirtd.log |grep Transitioned
2018-11-09 08:27:56.687+0000: 19726: debug : qemuProcessHandleShutdown:573 : Transitioned guest rhel7.6 to shutdown state
2018-11-09 08:27:56.696+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'booted', event detail 0
2.Crash panic
Guest xml config:
<on_crash>coredump-restart</on_crash>
…
<panic model='isa'>
<address type='isa' iobase='0x505'/>
</panic>
...
# virsh start rhel7.6
[in guest]:
# systemctl stop kdump; systemctl disable kdump
#sync
# echo c > /proc/sysrq-trigger
# virsh domstate rhel7.6 --reason
running (booted)
...
crashed (panicked)
...
running (crashed)
# virsh event --loop --all
event 'lifecycle' for domain rhel7.6: Crashed Panicked
event 'reboot' for domain rhel7.6
event 'lifecycle' for domain rhel7.6: Resumed Unpaused
event 'agent-lifecycle' for domain rhel7.6: state: 'disconnected' reason: 'channel event'
event 'reboot' for domain rhel7.6
event 'agent-lifecycle' for domain rhel7.6: state: 'connected' reason: 'channel event'
# cat libvirtd.log |grep Transition
2018-11-09 08:34:52.539+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'booted', event detail 0
2018-11-09 08:39:41.733+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'crashed', event detail 0
Scenario 3: Other resume related lifecycle events
1. # virsh start rhel7.6
event 'lifecycle' for domain rhel7.6: Resumed Unpaused
event 'lifecycle' for domain rhel7.6: Started Booted
2018-11-09 09:35:09.619+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'booted', event detail 0
2. # virsh save rhel7.6 rhel7.6.save
^Cerror: Failed to save domain rhel7.6 to rhel7.6.save
error: operation aborted: domain save job: canceled by client
event 'lifecycle' for domain rhel7.6: Suspended Paused
event 'migration-iteration' for domain rhel7.6: iteration: '1'
event 'lifecycle' for domain rhel7.6: Resumed Unpaused
2018-11-09 09:39:36.101+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'save canceled', event detail 0
3.# virsh restore rhel7.6.save
event 'lifecycle' for domain rhel7.6: Started Restored
event 'lifecycle' for domain rhel7.6: Resumed Snapshot
2018-11-09 09:38:44.950+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'restored', event detail 2
4.# virsh resume rhel7.6
event 'lifecycle' for domain rhel7.6: Resumed Unpaused
2018-11-09 09:43:52.766+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'unpaused', event detail 0
5.# virsh snapshot-create-as rhel7.6 snap1
# virsh snapshot-create-as rhel7.6 snap2
# virsh snapshot-revert rhel7.6 snap1
event 'lifecycle' for domain rhel7.6: Suspended Paused
event 'migration-iteration' for domain rhel7.6: iteration: '2'
event 'migration-iteration' for domain rhel7.6: iteration: '3'
event 'lifecycle' for domain rhel7.6: Resumed Unpaused
event 'lifecycle' for domain rhel7.6: Suspended Paused
event 'migration-iteration' for domain rhel7.6: iteration: '4'
event 'migration-iteration' for domain rhel7.6: iteration: '5'
event 'lifecycle' for domain rhel7.6: Resumed Unpaused
event 'lifecycle' for domain rhel7.6: Suspended Paused
event 'lifecycle' for domain rhel7.6: Resumed Snapshot
event 'lifecycle' for domain rhel7.6: Defined Snapshot
2018-11-09 09:45:28.259+0000: 19726: debug : qemuProcessHandleStop:649 : Transitioned guest rhel7.6 to paused state, reason unknown
2018-11-09 09:45:30.877+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'unpaused', event detail 0
2018-11-09 09:47:59.230+0000: 19726: debug : qemuProcessHandleStop:649 : Transitioned guest rhel7.6 to paused state, reason unknown
2018-11-09 09:48:01.723+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'unpaused', event detail 0
2018-11-09 09:48:17.295+0000: 19726: debug : qemuProcessHandleStop:649 : Transitioned guest rhel7.6 to paused state, reason unknown
2018-11-09 09:48:17.638+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'from snapshot', event detail 2
6.# virsh migrate rhel7.6 --live qemu+ssh://hp-dl385pg8-07.lab.eng.pek2.redhat.com/system --verbose
Migration: [100 %]
[source]event 'migration-iteration' for domain rhel7.6: iteration: '1'
event 'migration-iteration' for domain rhel7.6: iteration: '2'
event 'lifecycle' for domain rhel7.6: Suspended Migrated
event 'migration-iteration' for domain rhel7.6: iteration: '3'
event 'lifecycle' for domain rhel7.6: Stopped Migrated
event 'job-completed' for domain rhel7.6:
operation: 5
time_elapsed: 4484
time_elapsed_net: 4482
downtime: 284
downtime_net: 282
setup_time: 12
data_total: 1208827904
data_processed: 320227350
data_remaining: 0
memory_total: 1208827904
memory_processed: 320227350
memory_remaining: 0
memory_bps: 112695300
memory_constant: 223876
memory_normal: 77537
memory_normal_bytes: 317591552
memory_dirty_rate: 0
memory_iteration: 3
memory_page_size: 4096
disk_total: 0
disk_processed: 0
disk_remaining: 0
2018-11-10 06:29:09.917+0000: 19726: debug : qemuProcessHandleStop:649 : Transitioned guest rhel7.6 to paused state, reason migration
[target]
event 'lifecycle' for domain rhel7.6: Started Migrated
event 'lifecycle' for domain rhel7.6: Resumed Migrated
2018-11-10 06:29:10.189+0000: 29819: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'migrated', event detail 1
7. virsh migrate rhel7.6 --live qemu+ssh://10.66.7.215/system --verbose --unsafe --postcopy --postcopy-after-precopy
Migration: [100 %]
[source]
event 'migration-iteration' for domain rhel7.6: iteration: '1'
event 'migration-iteration' for domain rhel7.6: iteration: '2'
event 'lifecycle' for domain rhel7.6: Suspended Migrated
event 'migration-iteration' for domain rhel7.6: iteration: '3'
event 'lifecycle' for domain rhel7.6: Stopped Migrated
event 'job-completed' for domain rhel7.6:
operation: 5
time_elapsed: 3380
time_elapsed_net: 3377
downtime: 142
downtime_net: 139
setup_time: 22
data_total: 1208827904
data_processed: 305071970
data_remaining: 0
memory_total: 1208827904
memory_processed: 305071970
memory_remaining: 0
memory_bps: 100481212
memory_constant: 223488
memory_normal: 73845
memory_normal_bytes: 302469120
memory_dirty_rate: 0
memory_iteration: 3
memory_page_size: 4096
disk_total: 0
disk_processed: 0
disk_remaining: 0
2018-11-10 06:34:55.609+0000: 29819: debug : qemuProcessHandleStop:649 : Transitioned guest rhel7.6 to paused state, reason migration
[target]
event 'lifecycle' for domain rhel7.6: Started Migrated
event 'lifecycle' for domain rhel7.6: Resumed Migrated
2018-11-10 06:34:55.743+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'migrated', event detail 1
8.Cancel migration when source guest paused, refer to https://bugzilla.redhat.com/show_bug.cgi?id=1634758#c6
[source]
event 'migration-iteration' for domain rhel7.6: iteration: '1'
event 'migration-iteration' for domain rhel7.6: iteration: '2'
event 'lifecycle' for domain rhel7.6: Suspended Migrated
event 'migration-iteration' for domain rhel7.6: iteration: '3'
event 'lifecycle' for domain rhel7.6: Resumed Migrated
2018-11-10 07:30:10.198+0000: 19726: debug : qemuProcessHandleStop:649 : Transitioned guest rhel7.6 to paused state, reason migration
2018-11-10 07:31:11.346+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'migration canceled', event detail 1
[target]
event 'lifecycle' for domain rhel7.6: Started Migrated
event 'lifecycle' for domain rhel7.6: Resumed Post-copy
event 'lifecycle' for domain rhel7.6: Stopped Failed
2018-11-10 07:30:10.752+0000: 29819: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'post-copy', event detail 3
Hi jiri, Pls help check latest test results as above. And I have two questions: 1.In Scenario3 step7, not see the 'Resumed Post-copy' event and "reason 'post-copy'" in log. But it appears in step8-'Cancel migration when source guest paused'. Is there sth we should pay attention to when test it? 2.Except migration, other paused state show 'reason unknown' in log. do we have plan to fix it? Thank you. A bug exists for Scenario3 step7: https://bugzilla.redhat.com/show_bug.cgi?id=1647365 Additional scenario: 1. watchdog paused and resume (1)Config guest xml with: <watchdog model='i6300esb' action='pause'/> (2)Install watchdog in guest os and set its /etc/watchdog as: interval = 130 watchdog-device = /dev/watchdog (3)In guest, start watchdog process, check event and log: [in guest]# watchdog -f event 'watchdog' for domain yan-V: pause event 'lifecycle' for domain yan-V: Suspended Watchdog 2018-11-12 07:50:54.133+0000: 29819: debug : qemuProcessHandleWatchdog:788 : Transitioned guest yan-V to paused state due to watchdog (4)virsh domstate yan-V --reason paused (watchdog) (5)# virsh resume yan-V event 'lifecycle' for domain yan-V: Resumed Unpaused 2018-11-12 07:53:56.465+0000: 29819: debug : qemuProcessHandleResume:707 : Transitioned guest yan-V into running state, reason 'unpaused', event detail 0 # virsh domstate yan-V --reason running (unpaused) Guest os works well. 2. watchdog reset Re-edit guest with 'reset' action and do above steps again: <watchdog model='i6300esb' action='reset'/> event 'watchdog' for domain yan-V: reset event 'reboot' for domain yan-V event 'reboot' for domain yan-V # virsh domstate yan-V --reason running (booted) Guest os rebooted and works well. (In reply to yanqzhan from comment #11) > 1.In Scenario3 step7, not see the 'Resumed Post-copy' event and "reason > 'post-copy'" in log. But it appears in step8-'Cancel migration when source > guest paused'. Is there sth we should pay attention to when test it? You didn't run virsh event with --timestamp which makes it hard to see what was happening without logs. But since I do see the 'Resumed Post-copy' on the destination, I suppose the migration was just too fast and did not switch to post-copy mode. In other words, your comment #12 is correct and this is covered by bug 1647365. > 2.Except migration, other paused state show 'reason unknown' in log. do we > have plan to fix it? Perhaps, but not within this bug. This is about fixing resume events. Please, file a new bug to track this. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3669 |