Bug 1647365
Summary: | VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY is never used in postcopy migration | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Fangge Jin <fjin> | ||||||
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Fangge Jin <fjin> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 7.6 | CC: | fjin, jdenemar, lizhu, mtessun, xuzhang, yafu, yalzhang, yanqzhan | ||||||
Target Milestone: | rc | Keywords: | Reopened, ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | libvirt-4.5.0-11.el7 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1649169 1654732 (view as bug list) | Environment: | |||||||
Last Closed: | 2019-08-06 13:14:02 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1649160, 1649169, 1651787, 1654732 | ||||||||
Attachments: |
|
Openstack depends on this event: https://github.com/openstack/nova/commit/1f48d3d83b4d5f6f9cd96ee06d2fc005635c1ff9 Also this event should be on target host: VIR_DOMAIN_EVENT_RESUMED_POSTCOPY Requesting post-copy migration with --postcopy-after-precopy option or calling virDomainMigrateStartPostCopy when appropriate does not guarantee the migration will actually switch to post-copy. If the migration is fast enough, it will just finish before it's switched to post-copy. And this is exactly what happened here. At the time libvirt asked QEMU to switch to post-copy mode, the second iteration of pre-copy migration was already done and QEMU paused virtual CPUs to migrate device state and the small part of memory which changed during the second iteration. From the source log file: {"timestamp": {"seconds": 1541587993, "microseconds": 568119}, "event": "MIGRATION_PASS", "data": {"pass": 1}} {"timestamp": {"seconds": 1541587993, "microseconds": 568262}, "event": "MIGRATION", "data": {"status": "active"}} {"timestamp": {"seconds": 1541587993, "microseconds": 784200}, "event": "MIGRATION_PASS", "data": {"pass": 2}} {"timestamp": {"seconds": 1541587993, "microseconds": 786788}, "event": "STOP"} {"timestamp": {"seconds": 1541587993, "microseconds": 789853}, "event": "MIGRATION", "data": {"status": "pre-switchover"}} {"timestamp": {"seconds": 1541587993, "microseconds": 798859}, "event": "MIGRATION", "data": {"status": "device"}} {"timestamp": {"seconds": 1541587993, "microseconds": 800595}, "event": "MIGRATION_PASS", "data": {"pass": 3}} {"timestamp": {"seconds": 1541587994, "microseconds": 46724}, "event": "MIGRATION", "data": {"status": "completed"}} as you can see the migration took less than 500 ms. And the window in which the migration can be switched to post-copy mode using --postcopy-after-precopy, i.e., between pass 2 and STOP is just 2.5 ms. If you want to test post-copy migration you need to make sure the domain changes a lot of memory pages during migration. Otherwise the second iteration will be too fast. BTW, using --timestamp option for virsh event would make it more obvious that the relevant events were sent within a very short time period without having to look at the log files. I tried again with the following steps. Migration enters postcopy phase indeed, there is VIR_DOMAIN_EVENT_RESUMED_POSTCOPY on target host, but still no VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY event. Steps: 1. Monitor domain events on both hosts: # virsh event --all --loop --timestamp 2. Start guest and launch stress in guest: # stress --cpu 8 --io 4 --vm 2 --vm-bytes 128M 3. Do live migration with --postcopy and switch to postcopy: Terminal 1: # virsh migrate-setspeed avocado-vt-vm1 1; virsh migrate avocado-vt-vm1 qemu+ssh://intel-e52650-16-2.englab.nay.redhat.com/system --live --verbose --postcopy Terminal 2: # virsh migrate-postcopy avocado-vt-vm1; date Thu Nov 8 01:24:10 EST 2018 4. Check virsh events output: 1) Src: 2018-11-08 06:24:06.944+0000: event 'migration-iteration' for domain avocado-vt-vm1: iteration: '1' 2018-11-08 06:24:10.624+0000: event 'lifecycle' for domain avocado-vt-vm1: Suspended Migrated 2018-11-08 06:24:10.648+0000: event 'migration-iteration' for domain avocado-vt-vm1: iteration: '2' 2018-11-08 06:24:19.713+0000: event 'lifecycle' for domain avocado-vt-vm1: Stopped Migrated 2018-11-08 06:24:19.713+0000: event 'job-completed' for domain avocado-vt-vm1: operation: 5 time_elapsed: 12885 time_elapsed_net: 12882 downtime: 414 downtime_net: 411 setup_time: 32 data_total: 1078796288 data_processed: 3949656 data_remaining: 1074262016 memory_total: 1078796288 memory_processed: 3949656 memory_remaining: 1074262016 memory_bps: 1067220 memory_constant: 145 memory_normal: 962 memory_normal_bytes: 3940352 memory_dirty_rate: 0 memory_iteration: 1 memory_page_size: 4096 disk_total: 0 disk_processed: 0 disk_remaining: 0 2) Target: 2018-11-08 06:24:06.835+0000: event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'disconnected' reason: 'domain started' 2018-11-08 06:24:06.872+0000: event 'lifecycle' for domain avocado-vt-vm1: Started Migrated 2018-11-08 06:24:10.637+0000: event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'disconnected' reason: 'channel event' 2018-11-08 06:24:10.979+0000: event 'lifecycle' for domain avocado-vt-vm1: Resumed Unpaused 2018-11-08 06:24:11.036+0000: event 'lifecycle' for domain avocado-vt-vm1: Resumed Post-copy 2018-11-08 06:24:19.487+0000: event 'lifecycle' for domain avocado-vt-vm1: Resumed Migrated 2018-11-08 06:24:19.500+0000: event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'connected' reason: 'channel event' Created attachment 1503213 [details]
libvirtd log on both source and target hosts
This is a bug introduced by the following commit when libvirt talks to QEMU >= 2.11. Libvirt older then 3.9.0 works fine even with new QEMU and current libvirt works fine with QEMU older than 2.11. Post-copy migration finishes normally (unless some other unrelated error occurs, of course), but some details are wrong: - the suspended event sent by the source libvirtd is incorrect - migration statistics are not updated at the end of migration - if migration failed in post-copy mode for some reason, the source libvirtd would happily try to resume the domain without knowing its state is incomplete commit 32c29f10db3f7e847508564d8d91848f41f43abb Refs: v3.8.0-245-g32c29f10db Author: Jiri Denemark <jdenemar> AuthorDate: Fri Oct 20 10:12:21 2017 +0200 Commit: Jiri Denemark <jdenemar> CommitDate: Thu Oct 26 10:36:02 2017 +0200 qemu: Enabled pause-before-switchover migration capability QEMU identified a race condition between the device state serialization and the end of storage migration. Both QEMU and libvirt needs to be updated to fix this. Our migration work flow is modified so that after starting the migration we to wait for QEMU to enter "pre-switchover", "postcopy-active", or "completed" state. Once there, we cancel all block jobs as usual. But if QEMU is in "pre-switchover", we need to resume the migration afterwards and wait again for the real end (either "postcopy-active" or "completed" state). Old QEMU will just enter either "postcopy-active" or "completed" directly, which is still correctly handled even by new libvirt. The "pre-switchover" state will only be entered if QEMU supports it and the pause-before-switchover capability was enabled. Thus all combinations of libvirt and QEMU will work, but only new QEMU with new libvirt will avoid the race condition. Signed-off-by: Jiri Denemark <jdenemar> Patch sent upstream for review: https://www.redhat.com/archives/libvir-list/2018-November/msg00678.html This bug is now fixed upstream by: commit eca9d21e6cc8129ec4426fbf1ace30e215b9cfbc Refs: v4.10.0-rc1-4-geca9d21e6c Author: Jiri Denemark <jdenemar> AuthorDate: Thu Nov 15 11:16:43 2018 +0100 Commit: Jiri Denemark <jdenemar> CommitDate: Thu Nov 29 11:36:15 2018 +0100 qemu: Fix post-copy migration on the source Post-copy migration has been broken on the source since commit v3.8.0-245-g32c29f10db which implemented support for pause-before-switchover QEMU migration capability. Even though the migration itself went well, the source did not really know when it switched to the post-copy mode despite the messages logged by MIGRATION event handler. As a result of this, the events emitted by source libvirtd were not accurate and statistics of the completed migration would cover only the pre-copy part of migration. Moreover, if migration failed during the post-copy phase for some reason, the source libvirtd would just happily resume the domain, which could lead to disk corruption. With the pause-before-switchover capability enabled, the order of events emitted by QEMU changed: pause-before-switchover disabled enabled MIGRATION, postcopy-active STOP STOP MIGRATION, pre-switchover MIGRATION, postcopy-active The STOP even handler checks the migration status (postcopy-active) and sets the domain state accordingly. Which is sufficient when pause-before-switchover is disabled, but once we enable it, the migration status is still active when we get STOP from QEMU. Thus the domain state set in the STOP handler has to be corrected once we are notified that migration changed to postcopy-active. This results in two SUSPENDED events to be emitted by the source libvirtd during post-copy migration. The first one with VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED detail, while the second one reports the corrected VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY detail. This is inevitable because we don't know whether migration will eventually switch to post-copy at the time we emit the first event. https://bugzilla.redhat.com/show_bug.cgi?id=1647365 Signed-off-by: Jiri Denemark <jdenemar> Reviewed-by: Ján Tomko <jtomko> Verified with libvirt-4.5.0-12.virtcov.el7.x86_64 qemu-kvm-rhev-2.12.0-26.el7.x86_64 Steps are same as bug 1649169 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2019:2294 |
Created attachment 1502926 [details] libvirtd log on both source and target hosts Description of problem: VIR_DOMAIN_EVENT_SUSPENDED_POSTCOPY is never used in postcopy migration Version-Release number of selected component (if applicable): libvirt-4.5.0-10.el7_6.2.x86_64 qemu-kvm-rhev-2.12.0-18.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1.Monitor domain events on both source and target hosts # virsh event --all --loop 2.Do postcopy live migration: # virsh migrate avocado-vt-vm1 qemu+ssh://192.168.122.225/system --live --verbose --postcopy --postcopy-after-precopy Migration: [100 %] 3.Check domain events output: 1)Source: event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'connected' reason: 'channel event' event 'migration-iteration' for domain avocado-vt-vm1: iteration: '1' event 'migration-iteration' for domain avocado-vt-vm1: iteration: '2' event 'lifecycle' for domain avocado-vt-vm1: Suspended Migrated event 'migration-iteration' for domain avocado-vt-vm1: iteration: '3' event 'lifecycle' for domain avocado-vt-vm1: Stopped Migrated event 'job-completed' for domain avocado-vt-vm1: 2) Target: event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'disconnected' reason: 'domain started' event 'lifecycle' for domain avocado-vt-vm1: Started Migrated event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'disconnected' reason: 'channel event' event 'lifecycle' for domain avocado-vt-vm1: Resumed Unpaused event 'lifecycle' for domain avocado-vt-vm1: Resumed Migrated event 'agent-lifecycle' for domain avocado-vt-vm1: state: 'connected' reason: 'channel event' Actual results: As step 3, there is no "Suspended Postcopy" event on source host, but there is a "Suspended Migrated" event Expected results: "Suspended Migrated" event is emitted when do postcopy migration. Additional info: