Bug 1791458

Summary: VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY event is emitted for incoming migration
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: Jiri Denemark <jdenemar>
Component: libvirtAssignee: Jiri Denemark <jdenemar>
Status: CLOSED ERRATA QA Contact: yafu <yafu>
Severity: high Docs Contact:
Priority: unspecified    
Version: 8.1CC: aefrat, bzlotnik, chhu, dyuan, fjin, gwatson, jdenemar, lmen, lsurette, michal.skrivanek, mzamazal, rhodain, srevivo, tnisan, xuzhang, yafu, yanqzhan, ycui
Target Milestone: rcKeywords: Upstream
Target Release: 8.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: libvirt-6.0.0-2.el8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1774230
: 1791886 (view as bug list) Environment:
Last Closed: 2020-05-05 09:55:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1774230, 1791886    

Description Jiri Denemark 2020-01-15 21:11:54 UTC
+++ This bug was initially created as a clone of Bug #1774230 +++

Description of problem:

During post-copy migration libvirtd on the destination host unexpectedly emits
VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY lifecycle event just before resuming the
migration in post-copy mode.

Version-Release number of selected component (if applicable):

Any libvirt version since RHEL 7.7 (bug 1647365):
libvirt-4.5.0-23.el7_7.5
libvirt-4.5.0-31.el7
libvirt-4.5.0-35.2.el8
libvirt-5.6.0-10.el8
libvirt-6.0.0-1.el8

How reproducible:

100%

Steps to Reproduce:

1. start a new domain on a source host
2. make the domain dirty memory (e.g., by running stress command):
    stress --vm 2 --vm-bytes 512M
3. start watching for lifecycle events on a destination host:
    virsh event --event lifecycle --loop --timestamp
4. migrate the domain from the source host to the destination host:
    virsh migrate --p2p --live --postcopy --postcopy-after-precopy $DOM $DEST_URI
5. check the lifecycle events reported on the destination during migration

Actual results:

2020-01-15 14:20:26.689+0000: event 'lifecycle' for domain nest: Started Migrated
2020-01-15 14:21:01.837+0000: event 'lifecycle' for domain nest: Suspended Post-copy
2020-01-15 14:21:03.266+0000: event 'lifecycle' for domain nest: Resumed Post-copy
2020-01-15 14:21:32.060+0000: event 'lifecycle' for domain nest: Resumed Migrated

Expected results:

2020-01-15 14:28:53.803+0000: event 'lifecycle' for domain nest: Started Migrated
2020-01-15 14:28:56.156+0000: event 'lifecycle' for domain nest: Resumed Post-copy
2020-01-15 14:28:56.258+0000: event 'lifecycle' for domain nest: Resumed Migrated

In other words, no "Suspended Post-copy" event should be reported.

Additional info:

This issue was nicely analyzed in the original bug 1774230:

--- Additional comment from Benny Zlotnik on 2020-01-15 09:02:25 UTC ---

Hi Jiri,

After investigating this bug and discussing the proposed patch with Milan,
there is something unclear. It seems that in post-copy migrate both source and
destination receive VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY, and it is probably
has something to do with the change from[1], as I see the following logs on
the destination:

2020-01-15 08:17:54.802+0000: 17327: debug : qemuProcessHandleMigrationStatus:1647 :
    Migration of domain 0x7fae6801c310 vmski changed state to post-copy-active
2020-01-15 08:17:54.802+0000: 17327: debug : qemuProcessHandleMigrationStatus:1663 :
    Correcting paused state reason for domain vmski to post-copy <--- I assume this emits the VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY event
2020-01-15 08:17:55.045+0000: 17327: debug : qemuProcessHandleResume:719 :
    Transitioned guest vmski into running state, reason 'post-copy', event detail 3

Is this the correct the behaviour, should the destination receive VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY as well?

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1647365

--- Additional comment from Jiri Denemark on 2020-01-15 10:06:46 UTC ---

Your investigation seems to be correct. The domain is started as paused on the
destination with "migration" reason. Once migration switches to post-copy, the
code in qemuProcessHandleMigrationStatus will update the reason to "post-copy"
and emit a "suspended" event just a moment before the domain is resumed, which
should only happen on the source.

Comment 1 Jiri Denemark 2020-01-16 13:18:53 UTC
Patch sent upstream for review: https://www.redhat.com/archives/libvir-list/2020-January/msg00732.html

Comment 2 Jiri Denemark 2020-01-16 14:40:28 UTC
This is now fixed upstream by

commit bd04d63ad97c21b6955710e6473a502f49816a3c
Refs: v6.0.0-23-gbd04d63ad9
Author:     Jiri Denemark <jdenemar>
AuthorDate: Wed Jan 15 15:24:55 2020 +0100
Commit:     Jiri Denemark <jdenemar>
CommitDate: Thu Jan 16 15:12:19 2020 +0100

    qemu: Don't emit SUSPENDED_POSTCOPY event on destination

    When pause-before-switchover QEMU capability is enabled, we get STOP
    event before MIGRATION event with postcopy-active state. To properly
    handle post-copy migration and emit correct events commit
    v4.10.0-rc1-4-geca9d21e6c added a hack to
    qemuProcessHandleMigrationStatus which translates the paused state
    reason to VIR_DOMAIN_PAUSED_POSTCOPY and emits
    VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY event when migration state changes
    to post-copy.

    However, the code was effective on both sides of migration resulting in
    a confusing VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY event on the destination
    host, where entering post-copy mode is already properly advertised by
    VIR_DOMAIN_EVENT_RESUMED_POSTCOPY event.

    https://bugzilla.redhat.com/show_bug.cgi?id=1791458

    Signed-off-by: Jiri Denemark <jdenemar>
    Reviewed-by: Ján Tomko <jtomko>

Comment 5 yafu 2020-02-11 03:53:42 UTC
Verified with libvirt-6.0.0-4.module+el8.2.0+5642+838f3513.x86_64.

Test steps are the same with https://bugzilla.redhat.com/show_bug.cgi?id=1791886#c10 and https://bugzilla.redhat.com/show_bug.cgi?id=1791886#c11.

Comment 7 errata-xmlrpc 2020-05-05 09:55:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2017