RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1647365 - VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY is never used in postcopy migration
Summary: VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY is never used in postcopy migration
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: libvirt
Version: 7.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: rc
: ---
Assignee: Jiri Denemark
QA Contact: Fangge Jin
URL:
Whiteboard:
Depends On:
Blocks: 1649160 1649169 1651787 1654732
TreeView+ depends on / blocked
 
Reported: 2018-11-07 10:59 UTC by Fangge Jin
Modified: 2019-08-06 13:14 UTC (History)
8 users (show)

Fixed In Version: libvirt-4.5.0-11.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1649169 1654732 (view as bug list)
Environment:
Last Closed: 2019-08-06 13:14:02 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
libvirtd log on both source and target hosts (160.93 KB, application/x-bzip)
2018-11-07 10:59 UTC, Fangge Jin
no flags Details
libvirtd log on both source and target hosts (407.35 KB, application/x-bzip)
2018-11-08 06:52 UTC, Fangge Jin
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2019:2294 0 None None None 2019-08-06 13:14:45 UTC

Description Fangge Jin 2018-11-07 10:59:52 UTC
Created attachment 1502926 [details]
libvirtd log on both source and target hosts

Description of problem:
VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY is never used in postcopy migration

Version-Release number of selected component (if applicable):
libvirt-4.5.0-10.el7_6.2.x86_64
qemu-kvm-rhev-2.12.0-18.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Monitor domain events on both source and target hosts
# virsh event --all --loop

2.Do postcopy live migration:
# virsh migrate avocado-vt-vm1 qemu+ssh://192.168.122.225/system --live --verbose --postcopy --postcopy-after-precopy
Migration: [100 %]


3.Check domain events output:
1)Source:
event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'connected' reason: 'channel event'
event 'migration-iteration' for domain avocado-vt-vm1: iteration: '1'
event 'migration-iteration' for domain avocado-vt-vm1: iteration: '2'
event 'lifecycle' for domain avocado-vt-vm1: Suspended Migrated
event 'migration-iteration' for domain avocado-vt-vm1: iteration: '3'
event 'lifecycle' for domain avocado-vt-vm1: Stopped Migrated
event 'job-completed' for domain avocado-vt-vm1:


2) Target:
event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'disconnected' reason: 'domain started'
event 'lifecycle' for domain avocado-vt-vm1: Started Migrated
event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'disconnected' reason: 'channel event'
event 'lifecycle' for domain avocado-vt-vm1: Resumed Unpaused
event 'lifecycle' for domain avocado-vt-vm1: Resumed Migrated
event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'connected' reason: 'channel event'


Actual results:
As step 3, there is no "Suspended Postcopy" event on source host, but there is a "Suspended Migrated" event

Expected results:
"Suspended Migrated" event is emitted when do postcopy migration.

Additional info:

Comment 2 Fangge Jin 2018-11-07 11:05:54 UTC
Openstack depends on this event: https://github.com/openstack/nova/commit/1f48d3d83b4d5f6f9cd96ee06d2fc005635c1ff9

Comment 3 Fangge Jin 2018-11-07 11:15:44 UTC
Also this event should be on target host:
VIR_DOMAIN_EVENT_RESUMED_POSTCOPY

Comment 4 Jiri Denemark 2018-11-07 12:42:07 UTC
Requesting post-copy migration with --postcopy-after-precopy option or calling
virDomainMigrateStartPostCopy when appropriate does not guarantee the
migration will actually switch to post-copy. If the migration is fast enough,
it will just finish before it's switched to post-copy.

And this is exactly what happened here. At the time libvirt asked QEMU to
switch to post-copy mode, the second iteration of pre-copy migration was
already done and QEMU paused virtual CPUs to migrate device state and the
small part of memory which changed during the second iteration.

From the source log file:

{"timestamp": {"seconds": 1541587993, "microseconds": 568119}, "event": "MIGRATION_PASS", "data": {"pass": 1}}
{"timestamp": {"seconds": 1541587993, "microseconds": 568262}, "event": "MIGRATION", "data": {"status": "active"}}
{"timestamp": {"seconds": 1541587993, "microseconds": 784200}, "event": "MIGRATION_PASS", "data": {"pass": 2}}
{"timestamp": {"seconds": 1541587993, "microseconds": 786788}, "event": "STOP"}
{"timestamp": {"seconds": 1541587993, "microseconds": 789853}, "event": "MIGRATION", "data": {"status": "pre-switchover"}}
{"timestamp": {"seconds": 1541587993, "microseconds": 798859}, "event": "MIGRATION", "data": {"status": "device"}}
{"timestamp": {"seconds": 1541587993, "microseconds": 800595}, "event": "MIGRATION_PASS", "data": {"pass": 3}}
{"timestamp": {"seconds": 1541587994, "microseconds":  46724}, "event": "MIGRATION", "data": {"status": "completed"}}

as you can see the migration took less than 500 ms. And the window in which
the migration can be switched to post-copy mode using
--postcopy-after-precopy, i.e., between pass 2 and STOP is just 2.5 ms.

If you want to test post-copy migration you need to make sure the domain
changes a lot of memory pages during migration. Otherwise the second iteration
will be too fast.

BTW, using --timestamp option for virsh event would make it more obvious that
the relevant events were sent within a very short time period without having
to look at the log files.

Comment 5 Fangge Jin 2018-11-08 06:48:18 UTC
I tried again with the following steps. Migration enters postcopy phase indeed, there is VIR_DOMAIN_EVENT_RESUMED_POSTCOPY on target host, but still no VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY event.

Steps:
1. Monitor domain events on both hosts:
# virsh event --all --loop --timestamp

2. Start guest and launch stress in guest:
# stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M 

3. Do live migration with --postcopy and switch to postcopy:
Terminal 1: # virsh migrate-setspeed avocado-vt-vm1 1; virsh migrate avocado-vt-vm1 qemu+ssh://intel-e52650-16-2.englab.nay.redhat.com/system --live --verbose --postcopy

Terminal 2: # virsh migrate-postcopy avocado-vt-vm1; date

Thu Nov  8 01:24:10 EST 2018

4. Check virsh events output:
1) Src:
2018-11-08 06:24:06.944+0000: event 'migration-iteration' for domain avocado-vt-vm1: iteration: '1'
2018-11-08 06:24:10.624+0000: event 'lifecycle' for domain avocado-vt-vm1: Suspended Migrated
2018-11-08 06:24:10.648+0000: event 'migration-iteration' for domain avocado-vt-vm1: iteration: '2'
2018-11-08 06:24:19.713+0000: event 'lifecycle' for domain avocado-vt-vm1: Stopped Migrated
2018-11-08 06:24:19.713+0000: event 'job-completed' for domain avocado-vt-vm1:
	operation: 5
	time_elapsed: 12885
	time_elapsed_net: 12882
	downtime: 414
	downtime_net: 411
	setup_time: 32
	data_total: 1078796288
	data_processed: 3949656
	data_remaining: 1074262016
	memory_total: 1078796288
	memory_processed: 3949656
	memory_remaining: 1074262016
	memory_bps: 1067220
	memory_constant: 145
	memory_normal: 962
	memory_normal_bytes: 3940352
	memory_dirty_rate: 0
	memory_iteration: 1
	memory_page_size: 4096
	disk_total: 0
	disk_processed: 0
	disk_remaining: 0

2) Target:
2018-11-08 06:24:06.835+0000: event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'disconnected' reason: 'domain started'
2018-11-08 06:24:06.872+0000: event 'lifecycle' for domain avocado-vt-vm1: Started Migrated
2018-11-08 06:24:10.637+0000: event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'disconnected' reason: 'channel event'
2018-11-08 06:24:10.979+0000: event 'lifecycle' for domain avocado-vt-vm1: Resumed Unpaused
2018-11-08 06:24:11.036+0000: event 'lifecycle' for domain avocado-vt-vm1: Resumed Post-copy
2018-11-08 06:24:19.487+0000: event 'lifecycle' for domain avocado-vt-vm1: Resumed Migrated
2018-11-08 06:24:19.500+0000: event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'connected' reason: 'channel event'

Comment 6 Fangge Jin 2018-11-08 06:52:12 UTC
Created attachment 1503213 [details]
libvirtd log on both source and target hosts

Comment 7 Jiri Denemark 2018-11-14 15:33:54 UTC
This is a bug introduced by the following commit when libvirt talks to
QEMU >= 2.11. Libvirt older then 3.9.0 works fine even with new QEMU and
current libvirt works fine with QEMU older than 2.11.

Post-copy migration finishes normally (unless some other unrelated error
occurs, of course), but some details are wrong:
- the suspended event sent by the source libvirtd is incorrect
- migration statistics are not updated at the end of migration
- if migration failed in post-copy mode for some reason, the source libvirtd
  would happily try to resume the domain without knowing its state is
  incomplete


commit 32c29f10db3f7e847508564d8d91848f41f43abb
Refs: v3.8.0-245-g32c29f10db
Author:     Jiri Denemark <jdenemar>
AuthorDate: Fri Oct 20 10:12:21 2017 +0200
Commit:     Jiri Denemark <jdenemar>
CommitDate: Thu Oct 26 10:36:02 2017 +0200

    qemu: Enabled pause-before-switchover migration capability

    QEMU identified a race condition between the device state serialization
    and the end of storage migration. Both QEMU and libvirt needs to be
    updated to fix this.

    Our migration work flow is modified so that after starting the migration
    we to wait for QEMU to enter "pre-switchover", "postcopy-active", or
    "completed" state. Once there, we cancel all block jobs as usual. But if
    QEMU is in "pre-switchover", we need to resume the migration afterwards
    and wait again for the real end (either "postcopy-active" or
    "completed" state).

    Old QEMU will just enter either "postcopy-active" or "completed"
    directly, which is still correctly handled even by new libvirt. The
    "pre-switchover" state will only be entered if QEMU supports it and the
    pause-before-switchover capability was enabled. Thus all combinations of
    libvirt and QEMU will work, but only new QEMU with new libvirt will
    avoid the race condition.

    Signed-off-by: Jiri Denemark <jdenemar>

Comment 8 Jiri Denemark 2018-11-16 13:17:22 UTC
Patch sent upstream for review: https://www.redhat.com/archives/libvir-list/2018-November/msg00678.html

Comment 11 Jiri Denemark 2018-11-29 10:54:16 UTC
This bug is now fixed upstream by:

commit eca9d21e6cc8129ec4426fbf1ace30e215b9cfbc
Refs: v4.10.0-rc1-4-geca9d21e6c
Author:     Jiri Denemark <jdenemar>
AuthorDate: Thu Nov 15 11:16:43 2018 +0100
Commit:     Jiri Denemark <jdenemar>
CommitDate: Thu Nov 29 11:36:15 2018 +0100

    qemu: Fix post-copy migration on the source

    Post-copy migration has been broken on the source since commit
    v3.8.0-245-g32c29f10db which implemented support for
    pause-before-switchover QEMU migration capability.

    Even though the migration itself went well, the source did not really
    know when it switched to the post-copy mode despite the messages logged
    by MIGRATION event handler. As a result of this, the events emitted by
    source libvirtd were not accurate and statistics of the completed
    migration would cover only the pre-copy part of migration. Moreover, if
    migration failed during the post-copy phase for some reason, the source
    libvirtd would just happily resume the domain, which could lead to disk
    corruption.

    With the pause-before-switchover capability enabled, the order of events
    emitted by QEMU changed:

                       	pause-before-switchover
               disabled                        enabled
        MIGRATION, postcopy-active	STOP
       	STOP                            MIGRATION, pre-switchover
                                       	MIGRATION, postcopy-active

    The STOP even handler checks the migration status (postcopy-active) and
    sets the domain state accordingly. Which is sufficient when
    pause-before-switchover is disabled, but once we enable it, the
    migration status is still active when we get STOP from QEMU. Thus the
    domain state set in the STOP handler has to be corrected once we are
    notified that migration changed to postcopy-active.

    This results in two SUSPENDED events to be emitted by the source
    libvirtd during post-copy migration. The first one with
    VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED detail, while the second one reports
    the corrected VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY detail. This is
    inevitable because we don't know whether migration will eventually
    switch to post-copy at the time we emit the first event.

    https://bugzilla.redhat.com/show_bug.cgi?id=1647365

    Signed-off-by: Jiri Denemark <jdenemar>
    Reviewed-by: Ján Tomko <jtomko>

Comment 14 Fangge Jin 2019-04-16 05:53:35 UTC
Verified with 
libvirt-4.5.0-12.virtcov.el7.x86_64
qemu-kvm-rhev-2.12.0-26.el7.x86_64


Steps are same as bug 1649169

Comment 16 errata-xmlrpc 2019-08-06 13:14:02 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2019:2294


Note You need to log in before you can comment on or make changes to this bug.