Bug 1634759
Summary: | Wrong RESUME event after I/O error [rhel-7.6.z] | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Oneata Mircea Teodor <toneata> |
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> |
Status: | CLOSED ERRATA | QA Contact: | Yanqiu Zhang <yanqzhan> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 7.5 | CC: | dyuan, fjin, jdenemar, jherrman, jsuchane, lizhu, lmen, mtessun, mzamazal, rbarry, tburke, xuzhang, yafu, yanqzhan |
Target Milestone: | rc | Keywords: | Reopened, ZStream |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | libvirt-4.5.0-10.el7_6.3 | Doc Type: | Bug Fix |
Doc Text: |
Previously, when a virtual machine (VM) was paused due to an I/O error and the error was not fixed, resuming the VM caused it to become paused again, and only a generic unpause event was logged. This update ensures that suspend events generated by the libvirt service properly indicate I/O errors. As a result, attempting to resume a VM with an unfixed I/O error now logs the correct event type.
|
Story Points: | --- |
Clone Of: | 1612943 | Environment: | |
Last Closed: | 2018-11-27 01:22:40 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1612943 | ||
Bug Blocks: |
Description
Oneata Mircea Teodor
2018-10-01 14:46:32 UTC
The "qemu: Avoid duplicate resume events and state changes" patch causes a regression with fake reboot (virsh reboot --mode acpi) and automatic restart of the domain when it crashes (<on_crash>restart</on_crash>). The domain is not switched back to running once it's restarted and the resume event is not emitted by libvirt. The patch fixing regression from comment 4 was sent upstream for review: https://www.redhat.com/archives/libvir-list/2018-November/msg00250.html The regression is now fixed upstream by commit e47949357ba268e7e8c3adea7c262b84fa002302 Refs: v4.9.0-21-ge47949357b Author: Jiri Denemark <jdenemar> AuthorDate: Wed Nov 7 14:34:52 2018 +0100 Commit: Jiri Denemark <jdenemar> CommitDate: Thu Nov 8 09:08:58 2018 +0100 qemu: Don't ignore resume events Since commit v4.7.0-302-ge6d77a75c4 processing RESUME event is mandatory for updating domain state. But the event handler explicitly ignored this event in some cases. Thus the state would be wrong after a fake reboot or when a domain was rebooted after it crashed. BTW, the code to ignore RESUME event after SHUTDOWN didn't make sense even before making RESUME event mandatory. Most likely it was there as a result of careless copy&paste from qemuProcessHandleStop. The corresponding debug message was clarified since the original state does not have to be "paused" only and while we have a "resumed" event, the state is called "running". https://bugzilla.redhat.com/show_bug.cgi?id=1612943 Signed-off-by: Jiri Denemark <jdenemar> Verify this bug with: libvirt-4.5.0-10.el7_6.3.x86_64 qemu-kvm-rhev-2.12.0-19.el7_6.2.x86_64 Steps to verify: Scenario 1: Guest IOERROR and manual resume: # cat event.log|grep lifecycle event 'lifecycle' for domain rhel7.6: Suspended I/O Error event 'lifecycle' for domain rhel7.6: Resumed Unpaused event 'lifecycle' for domain rhel7.6: Suspended I/O Error # cat /var/log/libvirt/libvirtd.log|grep "Transitioned guest" 2018-11-09 08:04:32.240+0000: 19726: debug : qemuProcessHandleIOError:867 : Transitioned guest rhel7.6 to paused state due to IO error 2018-11-09 08:05:05.904+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'unpaused', event detail 0 2018-11-09 08:05:05.908+0000: 19726: debug : qemuProcessHandleIOError:867 : Transitioned guest rhel7.6 to paused state due to IO error Scenario 2: acpi reboot and coredump-restart 1.acpi reboot Start guest with acpi and reboot guest: <features> <acpi/> <apic/> .. # virsh reboot rhel7.6 --mode acpi Domain rhel7.6 is being rebooted # virsh event --loop --all event 'agent-lifecycle' for domain rhel7.6: state: 'disconnected' reason: 'channel event' event 'lifecycle' for domain rhel7.6: Shutdown Finished after guest request event 'reboot' for domain rhel7.6 event 'lifecycle' for domain rhel7.6: Resumed Unpaused event 'reboot' for domain rhel7.6 event 'agent-lifecycle' for domain rhel7.6: state: 'connected' reason: 'channel event' Check guest os rebooted and can login successfully. # virsh domstate rhel7.6 --reason running (booted) # cat libvirtd.log |grep Transitioned 2018-11-09 08:27:56.687+0000: 19726: debug : qemuProcessHandleShutdown:573 : Transitioned guest rhel7.6 to shutdown state 2018-11-09 08:27:56.696+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'booted', event detail 0 2.Crash panic Guest xml config: <on_crash>coredump-restart</on_crash> … <panic model='isa'> <address type='isa' iobase='0x505'/> </panic> ... # virsh start rhel7.6 [in guest]: # systemctl stop kdump; systemctl disable kdump #sync # echo c > /proc/sysrq-trigger # virsh domstate rhel7.6 --reason running (booted) ... crashed (panicked) ... running (crashed) # virsh event --loop --all event 'lifecycle' for domain rhel7.6: Crashed Panicked event 'reboot' for domain rhel7.6 event 'lifecycle' for domain rhel7.6: Resumed Unpaused event 'agent-lifecycle' for domain rhel7.6: state: 'disconnected' reason: 'channel event' event 'reboot' for domain rhel7.6 event 'agent-lifecycle' for domain rhel7.6: state: 'connected' reason: 'channel event' # cat libvirtd.log |grep Transition 2018-11-09 08:34:52.539+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'booted', event detail 0 2018-11-09 08:39:41.733+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'crashed', event detail 0 Scenario 3: Other resume related lifecycle events 1. # virsh start rhel7.6 event 'lifecycle' for domain rhel7.6: Resumed Unpaused event 'lifecycle' for domain rhel7.6: Started Booted 2018-11-09 09:35:09.619+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'booted', event detail 0 2. # virsh save rhel7.6 rhel7.6.save ^Cerror: Failed to save domain rhel7.6 to rhel7.6.save error: operation aborted: domain save job: canceled by client event 'lifecycle' for domain rhel7.6: Suspended Paused event 'migration-iteration' for domain rhel7.6: iteration: '1' event 'lifecycle' for domain rhel7.6: Resumed Unpaused 2018-11-09 09:39:36.101+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'save canceled', event detail 0 3.# virsh restore rhel7.6.save event 'lifecycle' for domain rhel7.6: Started Restored event 'lifecycle' for domain rhel7.6: Resumed Snapshot 2018-11-09 09:38:44.950+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'restored', event detail 2 4.# virsh resume rhel7.6 event 'lifecycle' for domain rhel7.6: Resumed Unpaused 2018-11-09 09:43:52.766+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'unpaused', event detail 0 5.# virsh snapshot-create-as rhel7.6 snap1 # virsh snapshot-create-as rhel7.6 snap2 # virsh snapshot-revert rhel7.6 snap1 event 'lifecycle' for domain rhel7.6: Suspended Paused event 'migration-iteration' for domain rhel7.6: iteration: '2' event 'migration-iteration' for domain rhel7.6: iteration: '3' event 'lifecycle' for domain rhel7.6: Resumed Unpaused event 'lifecycle' for domain rhel7.6: Suspended Paused event 'migration-iteration' for domain rhel7.6: iteration: '4' event 'migration-iteration' for domain rhel7.6: iteration: '5' event 'lifecycle' for domain rhel7.6: Resumed Unpaused event 'lifecycle' for domain rhel7.6: Suspended Paused event 'lifecycle' for domain rhel7.6: Resumed Snapshot event 'lifecycle' for domain rhel7.6: Defined Snapshot 2018-11-09 09:45:28.259+0000: 19726: debug : qemuProcessHandleStop:649 : Transitioned guest rhel7.6 to paused state, reason unknown 2018-11-09 09:45:30.877+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'unpaused', event detail 0 2018-11-09 09:47:59.230+0000: 19726: debug : qemuProcessHandleStop:649 : Transitioned guest rhel7.6 to paused state, reason unknown 2018-11-09 09:48:01.723+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'unpaused', event detail 0 2018-11-09 09:48:17.295+0000: 19726: debug : qemuProcessHandleStop:649 : Transitioned guest rhel7.6 to paused state, reason unknown 2018-11-09 09:48:17.638+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'from snapshot', event detail 2 6.# virsh migrate rhel7.6 --live qemu+ssh://hp-dl385pg8-07.lab.eng.pek2.redhat.com/system --verbose Migration: [100 %] [source]event 'migration-iteration' for domain rhel7.6: iteration: '1' event 'migration-iteration' for domain rhel7.6: iteration: '2' event 'lifecycle' for domain rhel7.6: Suspended Migrated event 'migration-iteration' for domain rhel7.6: iteration: '3' event 'lifecycle' for domain rhel7.6: Stopped Migrated event 'job-completed' for domain rhel7.6: operation: 5 time_elapsed: 4484 time_elapsed_net: 4482 downtime: 284 downtime_net: 282 setup_time: 12 data_total: 1208827904 data_processed: 320227350 data_remaining: 0 memory_total: 1208827904 memory_processed: 320227350 memory_remaining: 0 memory_bps: 112695300 memory_constant: 223876 memory_normal: 77537 memory_normal_bytes: 317591552 memory_dirty_rate: 0 memory_iteration: 3 memory_page_size: 4096 disk_total: 0 disk_processed: 0 disk_remaining: 0 2018-11-10 06:29:09.917+0000: 19726: debug : qemuProcessHandleStop:649 : Transitioned guest rhel7.6 to paused state, reason migration [target] event 'lifecycle' for domain rhel7.6: Started Migrated event 'lifecycle' for domain rhel7.6: Resumed Migrated 2018-11-10 06:29:10.189+0000: 29819: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'migrated', event detail 1 7. virsh migrate rhel7.6 --live qemu+ssh://10.66.7.215/system --verbose --unsafe --postcopy --postcopy-after-precopy Migration: [100 %] [source] event 'migration-iteration' for domain rhel7.6: iteration: '1' event 'migration-iteration' for domain rhel7.6: iteration: '2' event 'lifecycle' for domain rhel7.6: Suspended Migrated event 'migration-iteration' for domain rhel7.6: iteration: '3' event 'lifecycle' for domain rhel7.6: Stopped Migrated event 'job-completed' for domain rhel7.6: operation: 5 time_elapsed: 3380 time_elapsed_net: 3377 downtime: 142 downtime_net: 139 setup_time: 22 data_total: 1208827904 data_processed: 305071970 data_remaining: 0 memory_total: 1208827904 memory_processed: 305071970 memory_remaining: 0 memory_bps: 100481212 memory_constant: 223488 memory_normal: 73845 memory_normal_bytes: 302469120 memory_dirty_rate: 0 memory_iteration: 3 memory_page_size: 4096 disk_total: 0 disk_processed: 0 disk_remaining: 0 2018-11-10 06:34:55.609+0000: 29819: debug : qemuProcessHandleStop:649 : Transitioned guest rhel7.6 to paused state, reason migration [target] event 'lifecycle' for domain rhel7.6: Started Migrated event 'lifecycle' for domain rhel7.6: Resumed Migrated 2018-11-10 06:34:55.743+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'migrated', event detail 1 8.Cancel migration when source guest paused, refer to https://bugzilla.redhat.com/show_bug.cgi?id=1634758#c6 [source] event 'migration-iteration' for domain rhel7.6: iteration: '1' event 'migration-iteration' for domain rhel7.6: iteration: '2' event 'lifecycle' for domain rhel7.6: Suspended Migrated event 'migration-iteration' for domain rhel7.6: iteration: '3' event 'lifecycle' for domain rhel7.6: Resumed Migrated 2018-11-10 07:30:10.198+0000: 19726: debug : qemuProcessHandleStop:649 : Transitioned guest rhel7.6 to paused state, reason migration 2018-11-10 07:31:11.346+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'migration canceled', event detail 1 [target] event 'lifecycle' for domain rhel7.6: Started Migrated event 'lifecycle' for domain rhel7.6: Resumed Post-copy event 'lifecycle' for domain rhel7.6: Stopped Failed 2018-11-10 07:30:10.752+0000: 29819: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'post-copy', event detail 3 Hi jiri, Pls help check latest test results as above. And I have two questions: 1.In Scenario3 step7, not see the 'Resumed Post-copy' event and "reason 'post-copy'" in log. But it appears in step8-'Cancel migration when source guest paused'. Is there sth we should pay attention to when test it? 2.Except migration, other paused state show 'reason unknown' in log. do we have plan to fix it? Thank you. A bug exists for Scenario3 step7: https://bugzilla.redhat.com/show_bug.cgi?id=1647365 Additional scenario: 1. watchdog paused and resume (1)Config guest xml with: <watchdog model='i6300esb' action='pause'/> (2)Install watchdog in guest os and set its /etc/watchdog as: interval = 130 watchdog-device = /dev/watchdog (3)In guest, start watchdog process, check event and log: [in guest]# watchdog -f event 'watchdog' for domain yan-V: pause event 'lifecycle' for domain yan-V: Suspended Watchdog 2018-11-12 07:50:54.133+0000: 29819: debug : qemuProcessHandleWatchdog:788 : Transitioned guest yan-V to paused state due to watchdog (4)virsh domstate yan-V --reason paused (watchdog) (5)# virsh resume yan-V event 'lifecycle' for domain yan-V: Resumed Unpaused 2018-11-12 07:53:56.465+0000: 29819: debug : qemuProcessHandleResume:707 : Transitioned guest yan-V into running state, reason 'unpaused', event detail 0 # virsh domstate yan-V --reason running (unpaused) Guest os works well. 2. watchdog reset Re-edit guest with 'reset' action and do above steps again: <watchdog model='i6300esb' action='reset'/> event 'watchdog' for domain yan-V: reset event 'reboot' for domain yan-V event 'reboot' for domain yan-V # virsh domstate yan-V --reason running (booted) Guest os rebooted and works well. (In reply to yanqzhan from comment #11) > 1.In Scenario3 step7, not see the 'Resumed Post-copy' event and "reason > 'post-copy'" in log. But it appears in step8-'Cancel migration when source > guest paused'. Is there sth we should pay attention to when test it? You didn't run virsh event with --timestamp which makes it hard to see what was happening without logs. But since I do see the 'Resumed Post-copy' on the destination, I suppose the migration was just too fast and did not switch to post-copy mode. In other words, your comment #12 is correct and this is covered by bug 1647365. > 2.Except migration, other paused state show 'reason unknown' in log. do we > have plan to fix it? Perhaps, but not within this bug. This is about fixing resume events. Please, file a new bug to track this. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3669 |