Bug 1634759

Summary: Wrong RESUME event after I/O error [rhel-7.6.z]
Product: Red Hat Enterprise Linux 7 Reporter: Oneata Mircea Teodor <toneata>
Component: libvirtAssignee: Jiri Denemark <jdenemar>
Status: CLOSED ERRATA QA Contact: Yanqiu Zhang <yanqzhan>
Severity: high Docs Contact:
Priority: high    
Version: 7.5CC: dyuan, fjin, jdenemar, jherrman, jsuchane, lizhu, lmen, mtessun, mzamazal, rbarry, tburke, xuzhang, yafu, yanqzhan
Target Milestone: rcKeywords: Reopened, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-4.5.0-10.el7_6.3 Doc Type: Bug Fix
Doc Text:
Previously, when a virtual machine (VM) was paused due to an I/O error and the error was not fixed, resuming the VM caused it to become paused again, and only a generic unpause event was logged. This update ensures that suspend events generated by the libvirt service properly indicate I/O errors. As a result, attempting to resume a VM with an unfixed I/O error now logs the correct event type.
Story Points: ---
Clone Of: 1612943 Environment:
Last Closed: 2018-11-27 01:22:40 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1612943    
Bug Blocks:    

Description Oneata Mircea Teodor 2018-10-01 14:46:32 UTC
This bug has been copied from bug #1612943 and has been proposed to be backported to 7.6 z-stream (EUS).

Comment 4 Jiri Denemark 2018-10-26 16:14:45 UTC
The "qemu: Avoid duplicate resume events and state changes" patch causes a
regression with fake reboot (virsh reboot --mode acpi) and automatic restart
of the domain when it crashes (<on_crash>restart</on_crash>). The domain is
not switched back to running once it's restarted and the resume event is not
emitted by libvirt.

Comment 5 Jiri Denemark 2018-11-07 14:24:18 UTC
The patch fixing regression from comment 4 was sent upstream for review:
https://www.redhat.com/archives/libvir-list/2018-November/msg00250.html

Comment 6 Jiri Denemark 2018-11-08 09:57:24 UTC
The regression is now fixed upstream by

commit e47949357ba268e7e8c3adea7c262b84fa002302
Refs: v4.9.0-21-ge47949357b
Author:     Jiri Denemark <jdenemar>
AuthorDate: Wed Nov 7 14:34:52 2018 +0100
Commit:     Jiri Denemark <jdenemar>
CommitDate: Thu Nov 8 09:08:58 2018 +0100

    qemu: Don't ignore resume events

    Since commit v4.7.0-302-ge6d77a75c4 processing RESUME event is mandatory
    for updating domain state. But the event handler explicitly ignored this
    event in some cases. Thus the state would be wrong after a fake reboot
    or when a domain was rebooted after it crashed.

    BTW, the code to ignore RESUME event after SHUTDOWN didn't make sense
    even before making RESUME event mandatory. Most likely it was there as a
    result of careless copy&paste from qemuProcessHandleStop.

    The corresponding debug message was clarified since the original state
    does not have to be "paused" only and while we have a "resumed" event,
    the state is called "running".

    https://bugzilla.redhat.com/show_bug.cgi?id=1612943

    Signed-off-by: Jiri Denemark <jdenemar>

Comment 9 Yanqiu Zhang 2018-11-09 09:06:45 UTC
Verify this bug with:
libvirt-4.5.0-10.el7_6.3.x86_64
qemu-kvm-rhev-2.12.0-19.el7_6.2.x86_64

Steps to verify:
Scenario 1: Guest IOERROR and manual resume:
# cat event.log|grep lifecycle
event 'lifecycle' for domain rhel7.6: Suspended I/O Error
event 'lifecycle' for domain rhel7.6: Resumed Unpaused
event 'lifecycle' for domain rhel7.6: Suspended I/O Error

# cat /var/log/libvirt/libvirtd.log|grep "Transitioned guest"
2018-11-09 08:04:32.240+0000: 19726: debug : qemuProcessHandleIOError:867 : Transitioned guest rhel7.6 to paused state due to IO error
2018-11-09 08:05:05.904+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'unpaused', event detail 0
2018-11-09 08:05:05.908+0000: 19726: debug : qemuProcessHandleIOError:867 : Transitioned guest rhel7.6 to paused state due to IO error

Scenario 2: acpi reboot and coredump-restart
1.acpi reboot
 Start guest with acpi and reboot guest:
  <features>
    <acpi/>
    <apic/>
..
# virsh reboot rhel7.6 --mode acpi
Domain rhel7.6 is being rebooted

#  virsh event --loop --all
event 'agent-lifecycle' for domain rhel7.6: state: 'disconnected' reason: 'channel event'
event 'lifecycle' for domain rhel7.6: Shutdown Finished after guest request
event 'reboot' for domain rhel7.6
event 'lifecycle' for domain rhel7.6: Resumed Unpaused
event 'reboot' for domain rhel7.6
event 'agent-lifecycle' for domain rhel7.6: state: 'connected' reason: 'channel event'

Check guest os rebooted and can login successfully.
# virsh domstate rhel7.6 --reason
running (booted)

# cat libvirtd.log |grep Transitioned
2018-11-09 08:27:56.687+0000: 19726: debug : qemuProcessHandleShutdown:573 : Transitioned guest rhel7.6 to shutdown state
2018-11-09 08:27:56.696+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'booted', event detail 0

2.Crash panic
Guest xml config:
  <on_crash>coredump-restart</on_crash>
…
    <panic model='isa'>
      <address type='isa' iobase='0x505'/>
    </panic>
...

# virsh start rhel7.6 
[in guest]:
# systemctl stop kdump; systemctl disable kdump
#sync
# echo c > /proc/sysrq-trigger

# virsh domstate rhel7.6 --reason
running (booted)
...
crashed (panicked)
...
running (crashed)

#  virsh event --loop --all
event 'lifecycle' for domain rhel7.6: Crashed Panicked
event 'reboot' for domain rhel7.6
event 'lifecycle' for domain rhel7.6: Resumed Unpaused
event 'agent-lifecycle' for domain rhel7.6: state: 'disconnected' reason: 'channel event'
event 'reboot' for domain rhel7.6
event 'agent-lifecycle' for domain rhel7.6: state: 'connected' reason: 'channel event'

# cat libvirtd.log |grep Transition
2018-11-09 08:34:52.539+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'booted', event detail 0
2018-11-09 08:39:41.733+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'crashed', event detail 0

Comment 10 Yanqiu Zhang 2018-11-10 08:27:54 UTC
Scenario 3: Other resume related lifecycle events
1. # virsh start rhel7.6
event 'lifecycle' for domain rhel7.6: Resumed Unpaused
event 'lifecycle' for domain rhel7.6: Started Booted

2018-11-09 09:35:09.619+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'booted', event detail 0

2. # virsh save rhel7.6 rhel7.6.save
^Cerror: Failed to save domain rhel7.6 to rhel7.6.save
error: operation aborted: domain save job: canceled by client

event 'lifecycle' for domain rhel7.6: Suspended Paused
event 'migration-iteration' for domain rhel7.6: iteration: '1'
event 'lifecycle' for domain rhel7.6: Resumed Unpaused

2018-11-09 09:39:36.101+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'save canceled', event detail 0

3.# virsh restore rhel7.6.save
event 'lifecycle' for domain rhel7.6: Started Restored
event 'lifecycle' for domain rhel7.6: Resumed Snapshot

2018-11-09 09:38:44.950+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'restored', event detail 2

4.# virsh resume rhel7.6
event 'lifecycle' for domain rhel7.6: Resumed Unpaused

2018-11-09 09:43:52.766+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'unpaused', event detail 0

5.# virsh snapshot-create-as rhel7.6 snap1
# virsh snapshot-create-as rhel7.6 snap2
# virsh snapshot-revert rhel7.6 snap1

event 'lifecycle' for domain rhel7.6: Suspended Paused
event 'migration-iteration' for domain rhel7.6: iteration: '2'
event 'migration-iteration' for domain rhel7.6: iteration: '3'
event 'lifecycle' for domain rhel7.6: Resumed Unpaused
event 'lifecycle' for domain rhel7.6: Suspended Paused
event 'migration-iteration' for domain rhel7.6: iteration: '4'
event 'migration-iteration' for domain rhel7.6: iteration: '5'
event 'lifecycle' for domain rhel7.6: Resumed Unpaused
event 'lifecycle' for domain rhel7.6: Suspended Paused
event 'lifecycle' for domain rhel7.6: Resumed Snapshot
event 'lifecycle' for domain rhel7.6: Defined Snapshot

2018-11-09 09:45:28.259+0000: 19726: debug : qemuProcessHandleStop:649 : Transitioned guest rhel7.6 to paused state, reason unknown
2018-11-09 09:45:30.877+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'unpaused', event detail 0
2018-11-09 09:47:59.230+0000: 19726: debug : qemuProcessHandleStop:649 : Transitioned guest rhel7.6 to paused state, reason unknown
2018-11-09 09:48:01.723+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'unpaused', event detail 0
2018-11-09 09:48:17.295+0000: 19726: debug : qemuProcessHandleStop:649 : Transitioned guest rhel7.6 to paused state, reason unknown
2018-11-09 09:48:17.638+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'from snapshot', event detail 2

6.# virsh migrate rhel7.6 --live qemu+ssh://hp-dl385pg8-07.lab.eng.pek2.redhat.com/system --verbose 
Migration: [100 %]

[source]event 'migration-iteration' for domain rhel7.6: iteration: '1'
event 'migration-iteration' for domain rhel7.6: iteration: '2'
event 'lifecycle' for domain rhel7.6: Suspended Migrated
event 'migration-iteration' for domain rhel7.6: iteration: '3'
event 'lifecycle' for domain rhel7.6: Stopped Migrated
event 'job-completed' for domain rhel7.6:
    operation: 5
    time_elapsed: 4484
    time_elapsed_net: 4482
    downtime: 284
    downtime_net: 282
    setup_time: 12
    data_total: 1208827904
    data_processed: 320227350
    data_remaining: 0
    memory_total: 1208827904
    memory_processed: 320227350
    memory_remaining: 0
    memory_bps: 112695300
    memory_constant: 223876
    memory_normal: 77537
    memory_normal_bytes: 317591552
    memory_dirty_rate: 0
    memory_iteration: 3
    memory_page_size: 4096
    disk_total: 0
    disk_processed: 0
    disk_remaining: 0

2018-11-10 06:29:09.917+0000: 19726: debug : qemuProcessHandleStop:649 : Transitioned guest rhel7.6 to paused state, reason migration

[target]
event 'lifecycle' for domain rhel7.6: Started Migrated
event 'lifecycle' for domain rhel7.6: Resumed Migrated

2018-11-10 06:29:10.189+0000: 29819: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'migrated', event detail 1

7. virsh migrate rhel7.6 --live qemu+ssh://10.66.7.215/system --verbose --unsafe --postcopy --postcopy-after-precopy
Migration: [100 %]

[source]
event 'migration-iteration' for domain rhel7.6: iteration: '1'
event 'migration-iteration' for domain rhel7.6: iteration: '2'
event 'lifecycle' for domain rhel7.6: Suspended Migrated
event 'migration-iteration' for domain rhel7.6: iteration: '3'
event 'lifecycle' for domain rhel7.6: Stopped Migrated
event 'job-completed' for domain rhel7.6:
    operation: 5
    time_elapsed: 3380
    time_elapsed_net: 3377
    downtime: 142
    downtime_net: 139
    setup_time: 22
    data_total: 1208827904
    data_processed: 305071970
    data_remaining: 0
    memory_total: 1208827904
    memory_processed: 305071970
    memory_remaining: 0
    memory_bps: 100481212
    memory_constant: 223488
    memory_normal: 73845
    memory_normal_bytes: 302469120
    memory_dirty_rate: 0
    memory_iteration: 3
    memory_page_size: 4096
    disk_total: 0
    disk_processed: 0
    disk_remaining: 0

2018-11-10 06:34:55.609+0000: 29819: debug : qemuProcessHandleStop:649 : Transitioned guest rhel7.6 to paused state, reason migration

[target]
event 'lifecycle' for domain rhel7.6: Started Migrated
event 'lifecycle' for domain rhel7.6: Resumed Migrated

2018-11-10 06:34:55.743+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'migrated', event detail 1

8.Cancel migration when source guest paused, refer to https://bugzilla.redhat.com/show_bug.cgi?id=1634758#c6
[source] 
event 'migration-iteration' for domain rhel7.6: iteration: '1'
event 'migration-iteration' for domain rhel7.6: iteration: '2'
event 'lifecycle' for domain rhel7.6: Suspended Migrated
event 'migration-iteration' for domain rhel7.6: iteration: '3'
event 'lifecycle' for domain rhel7.6: Resumed Migrated

2018-11-10 07:30:10.198+0000: 19726: debug : qemuProcessHandleStop:649 : Transitioned guest rhel7.6 to paused state, reason migration
2018-11-10 07:31:11.346+0000: 19726: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'migration canceled', event detail 1

[target]
event 'lifecycle' for domain rhel7.6: Started Migrated
event 'lifecycle' for domain rhel7.6: Resumed Post-copy
event 'lifecycle' for domain rhel7.6: Stopped Failed

2018-11-10 07:30:10.752+0000: 29819: debug : qemuProcessHandleResume:707 : Transitioned guest rhel7.6 into running state, reason 'post-copy', event detail 3

Comment 11 Yanqiu Zhang 2018-11-10 10:49:33 UTC
Hi jiri,

Pls help check latest test results as above. And I have two questions:
1.In Scenario3 step7, not see the 'Resumed Post-copy' event and "reason 'post-copy'" in log. But it appears in step8-'Cancel migration when source guest paused'. Is there sth we should pay attention to when test it?
2.Except migration, other paused state show 'reason unknown' in log. do we have plan to fix it?

Thank you.

Comment 12 Yanqiu Zhang 2018-11-12 04:50:09 UTC
A bug exists for Scenario3 step7: https://bugzilla.redhat.com/show_bug.cgi?id=1647365

Comment 13 Yanqiu Zhang 2018-11-12 08:37:48 UTC
Additional scenario:
1. watchdog paused and resume
(1)Config guest xml with:
 <watchdog model='i6300esb' action='pause'/>

(2)Install watchdog in guest os and set its /etc/watchdog as:
interval = 130
watchdog-device = /dev/watchdog

(3)In guest, start watchdog process, check event and log:
[in guest]# watchdog -f

event 'watchdog' for domain yan-V: pause
event 'lifecycle' for domain yan-V: Suspended Watchdog

2018-11-12 07:50:54.133+0000: 29819: debug : qemuProcessHandleWatchdog:788 : Transitioned guest yan-V to paused state due to watchdog

(4)virsh domstate yan-V --reason
 paused (watchdog)

(5)# virsh resume yan-V
event 'lifecycle' for domain yan-V: Resumed Unpaused

2018-11-12 07:53:56.465+0000: 29819: debug : qemuProcessHandleResume:707 : Transitioned guest yan-V into running state, reason 'unpaused', event detail 0

# virsh domstate yan-V --reason
running (unpaused)

Guest os works well.

2. watchdog reset
 Re-edit guest with 'reset' action and do above steps again:
<watchdog model='i6300esb' action='reset'/>

event 'watchdog' for domain yan-V: reset
event 'reboot' for domain yan-V
event 'reboot' for domain yan-V

# virsh domstate yan-V --reason
running (booted)

Guest os rebooted and works well.

Comment 14 Jiri Denemark 2018-11-13 12:23:51 UTC
(In reply to yanqzhan from comment #11)
> 1.In Scenario3 step7, not see the 'Resumed Post-copy' event and "reason
> 'post-copy'" in log. But it appears in step8-'Cancel migration when source
> guest paused'. Is there sth we should pay attention to when test it?

You didn't run virsh event with --timestamp which makes it hard to see what
was happening without logs. But since I do see the 'Resumed Post-copy' on the
destination, I suppose the migration was just too fast and did not switch to
post-copy mode. In other words, your comment #12 is correct and this is
covered by bug 1647365.

> 2.Except migration, other paused state show 'reason unknown' in log. do we
> have plan to fix it?

Perhaps, but not within this bug. This is about fixing resume events. Please,
file a new bug to track this.

Comment 16 errata-xmlrpc 2018-11-27 01:22:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3669