Bug 1647365
| Summary: | VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY is never used in postcopy migration | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Fangge Jin <fjin> | ||||||
| Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Fangge Jin <fjin> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | 7.6 | CC: | fjin, jdenemar, lizhu, mtessun, xuzhang, yafu, yalzhang, yanqzhan | ||||||
| Target Milestone: | rc | Keywords: | Reopened, ZStream | ||||||
| Target Release: | --- | ||||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | libvirt-4.5.0-11.el7 | Doc Type: | If docs needed, set a value | ||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | |||||||||
| : | 1649169 1654732 (view as bug list) | Environment: | |||||||
| Last Closed: | 2019-08-06 13:14:02 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Bug Depends On: | |||||||||
| Bug Blocks: | 1649160, 1649169, 1651787, 1654732 | ||||||||
| Attachments: |
|
||||||||
Openstack depends on this event: https://github.com/openstack/nova/commit/1f48d3d83b4d5f6f9cd96ee06d2fc005635c1ff9 Also this event should be on target host: VIR_DOMAIN_EVENT_RESUMED_POSTCOPY Requesting post-copy migration with --postcopy-after-precopy option or calling
virDomainMigrateStartPostCopy when appropriate does not guarantee the
migration will actually switch to post-copy. If the migration is fast enough,
it will just finish before it's switched to post-copy.
And this is exactly what happened here. At the time libvirt asked QEMU to
switch to post-copy mode, the second iteration of pre-copy migration was
already done and QEMU paused virtual CPUs to migrate device state and the
small part of memory which changed during the second iteration.
From the source log file:
{"timestamp": {"seconds": 1541587993, "microseconds": 568119}, "event": "MIGRATION_PASS", "data": {"pass": 1}}
{"timestamp": {"seconds": 1541587993, "microseconds": 568262}, "event": "MIGRATION", "data": {"status": "active"}}
{"timestamp": {"seconds": 1541587993, "microseconds": 784200}, "event": "MIGRATION_PASS", "data": {"pass": 2}}
{"timestamp": {"seconds": 1541587993, "microseconds": 786788}, "event": "STOP"}
{"timestamp": {"seconds": 1541587993, "microseconds": 789853}, "event": "MIGRATION", "data": {"status": "pre-switchover"}}
{"timestamp": {"seconds": 1541587993, "microseconds": 798859}, "event": "MIGRATION", "data": {"status": "device"}}
{"timestamp": {"seconds": 1541587993, "microseconds": 800595}, "event": "MIGRATION_PASS", "data": {"pass": 3}}
{"timestamp": {"seconds": 1541587994, "microseconds": 46724}, "event": "MIGRATION", "data": {"status": "completed"}}
as you can see the migration took less than 500 ms. And the window in which
the migration can be switched to post-copy mode using
--postcopy-after-precopy, i.e., between pass 2 and STOP is just 2.5 ms.
If you want to test post-copy migration you need to make sure the domain
changes a lot of memory pages during migration. Otherwise the second iteration
will be too fast.
BTW, using --timestamp option for virsh event would make it more obvious that
the relevant events were sent within a very short time period without having
to look at the log files.
I tried again with the following steps. Migration enters postcopy phase indeed, there is VIR_DOMAIN_EVENT_RESUMED_POSTCOPY on target host, but still no VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY event. Steps: 1. Monitor domain events on both hosts: # virsh event --all --loop --timestamp 2. Start guest and launch stress in guest: # stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M 3. Do live migration with --postcopy and switch to postcopy: Terminal 1: # virsh migrate-setspeed avocado-vt-vm1 1; virsh migrate avocado-vt-vm1 qemu+ssh://intel-e52650-16-2.englab.nay.redhat.com/system --live --verbose --postcopy Terminal 2: # virsh migrate-postcopy avocado-vt-vm1; date Thu Nov 8 01:24:10 EST 2018 4. Check virsh events output: 1) Src: 2018-11-08 06:24:06.944+0000: event 'migration-iteration' for domain avocado-vt-vm1: iteration: '1' 2018-11-08 06:24:10.624+0000: event 'lifecycle' for domain avocado-vt-vm1: Suspended Migrated 2018-11-08 06:24:10.648+0000: event 'migration-iteration' for domain avocado-vt-vm1: iteration: '2' 2018-11-08 06:24:19.713+0000: event 'lifecycle' for domain avocado-vt-vm1: Stopped Migrated 2018-11-08 06:24:19.713+0000: event 'job-completed' for domain avocado-vt-vm1: operation: 5 time_elapsed: 12885 time_elapsed_net: 12882 downtime: 414 downtime_net: 411 setup_time: 32 data_total: 1078796288 data_processed: 3949656 data_remaining: 1074262016 memory_total: 1078796288 memory_processed: 3949656 memory_remaining: 1074262016 memory_bps: 1067220 memory_constant: 145 memory_normal: 962 memory_normal_bytes: 3940352 memory_dirty_rate: 0 memory_iteration: 1 memory_page_size: 4096 disk_total: 0 disk_processed: 0 disk_remaining: 0 2) Target: 2018-11-08 06:24:06.835+0000: event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'disconnected' reason: 'domain started' 2018-11-08 06:24:06.872+0000: event 'lifecycle' for domain avocado-vt-vm1: Started Migrated 2018-11-08 06:24:10.637+0000: event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'disconnected' reason: 'channel event' 2018-11-08 06:24:10.979+0000: event 'lifecycle' for domain avocado-vt-vm1: Resumed Unpaused 2018-11-08 06:24:11.036+0000: event 'lifecycle' for domain avocado-vt-vm1: Resumed Post-copy 2018-11-08 06:24:19.487+0000: event 'lifecycle' for domain avocado-vt-vm1: Resumed Migrated 2018-11-08 06:24:19.500+0000: event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'connected' reason: 'channel event' Created attachment 1503213 [details]
libvirtd log on both source and target hosts
This is a bug introduced by the following commit when libvirt talks to
QEMU >= 2.11. Libvirt older then 3.9.0 works fine even with new QEMU and
current libvirt works fine with QEMU older than 2.11.
Post-copy migration finishes normally (unless some other unrelated error
occurs, of course), but some details are wrong:
- the suspended event sent by the source libvirtd is incorrect
- migration statistics are not updated at the end of migration
- if migration failed in post-copy mode for some reason, the source libvirtd
would happily try to resume the domain without knowing its state is
incomplete
commit 32c29f10db3f7e847508564d8d91848f41f43abb
Refs: v3.8.0-245-g32c29f10db
Author: Jiri Denemark <jdenemar>
AuthorDate: Fri Oct 20 10:12:21 2017 +0200
Commit: Jiri Denemark <jdenemar>
CommitDate: Thu Oct 26 10:36:02 2017 +0200
qemu: Enabled pause-before-switchover migration capability
QEMU identified a race condition between the device state serialization
and the end of storage migration. Both QEMU and libvirt needs to be
updated to fix this.
Our migration work flow is modified so that after starting the migration
we to wait for QEMU to enter "pre-switchover", "postcopy-active", or
"completed" state. Once there, we cancel all block jobs as usual. But if
QEMU is in "pre-switchover", we need to resume the migration afterwards
and wait again for the real end (either "postcopy-active" or
"completed" state).
Old QEMU will just enter either "postcopy-active" or "completed"
directly, which is still correctly handled even by new libvirt. The
"pre-switchover" state will only be entered if QEMU supports it and the
pause-before-switchover capability was enabled. Thus all combinations of
libvirt and QEMU will work, but only new QEMU with new libvirt will
avoid the race condition.
Signed-off-by: Jiri Denemark <jdenemar>
Patch sent upstream for review: https://www.redhat.com/archives/libvir-list/2018-November/msg00678.html This bug is now fixed upstream by:
commit eca9d21e6cc8129ec4426fbf1ace30e215b9cfbc
Refs: v4.10.0-rc1-4-geca9d21e6c
Author: Jiri Denemark <jdenemar>
AuthorDate: Thu Nov 15 11:16:43 2018 +0100
Commit: Jiri Denemark <jdenemar>
CommitDate: Thu Nov 29 11:36:15 2018 +0100
qemu: Fix post-copy migration on the source
Post-copy migration has been broken on the source since commit
v3.8.0-245-g32c29f10db which implemented support for
pause-before-switchover QEMU migration capability.
Even though the migration itself went well, the source did not really
know when it switched to the post-copy mode despite the messages logged
by MIGRATION event handler. As a result of this, the events emitted by
source libvirtd were not accurate and statistics of the completed
migration would cover only the pre-copy part of migration. Moreover, if
migration failed during the post-copy phase for some reason, the source
libvirtd would just happily resume the domain, which could lead to disk
corruption.
With the pause-before-switchover capability enabled, the order of events
emitted by QEMU changed:
pause-before-switchover
disabled enabled
MIGRATION, postcopy-active STOP
STOP MIGRATION, pre-switchover
MIGRATION, postcopy-active
The STOP even handler checks the migration status (postcopy-active) and
sets the domain state accordingly. Which is sufficient when
pause-before-switchover is disabled, but once we enable it, the
migration status is still active when we get STOP from QEMU. Thus the
domain state set in the STOP handler has to be corrected once we are
notified that migration changed to postcopy-active.
This results in two SUSPENDED events to be emitted by the source
libvirtd during post-copy migration. The first one with
VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED detail, while the second one reports
the corrected VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY detail. This is
inevitable because we don't know whether migration will eventually
switch to post-copy at the time we emit the first event.
https://bugzilla.redhat.com/show_bug.cgi?id=1647365
Signed-off-by: Jiri Denemark <jdenemar>
Reviewed-by: Ján Tomko <jtomko>
Verified with libvirt-4.5.0-12.virtcov.el7.x86_64 qemu-kvm-rhev-2.12.0-26.el7.x86_64 Steps are same as bug 1649169 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:2294 |
Created attachment 1502926 [details] libvirtd log on both source and target hosts Description of problem: VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY is never used in postcopy migration Version-Release number of selected component (if applicable): libvirt-4.5.0-10.el7_6.2.x86_64 qemu-kvm-rhev-2.12.0-18.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1.Monitor domain events on both source and target hosts # virsh event --all --loop 2.Do postcopy live migration: # virsh migrate avocado-vt-vm1 qemu+ssh://192.168.122.225/system --live --verbose --postcopy --postcopy-after-precopy Migration: [100 %] 3.Check domain events output: 1)Source: event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'connected' reason: 'channel event' event 'migration-iteration' for domain avocado-vt-vm1: iteration: '1' event 'migration-iteration' for domain avocado-vt-vm1: iteration: '2' event 'lifecycle' for domain avocado-vt-vm1: Suspended Migrated event 'migration-iteration' for domain avocado-vt-vm1: iteration: '3' event 'lifecycle' for domain avocado-vt-vm1: Stopped Migrated event 'job-completed' for domain avocado-vt-vm1: 2) Target: event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'disconnected' reason: 'domain started' event 'lifecycle' for domain avocado-vt-vm1: Started Migrated event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'disconnected' reason: 'channel event' event 'lifecycle' for domain avocado-vt-vm1: Resumed Unpaused event 'lifecycle' for domain avocado-vt-vm1: Resumed Migrated event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'connected' reason: 'channel event' Actual results: As step 3, there is no "Suspended Postcopy" event on source host, but there is a "Suspended Migrated" event Expected results: "Suspended Migrated" event is emitted when do postcopy migration. Additional info: