Bug 690175
Summary: | [Libvirt][qemu-kvm] Split-brain when migrating vm and restarting libvirtd. | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | David Naori <dnaori> |
Component: | libvirt | Assignee: | Jiri Denemark <jdenemar> |
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 6.1 | CC: | dallan, dnaori, dyuan, eblake, hateya, jhenner, mgoldboi, vbian, weizhan, yoyzhang |
Target Milestone: | rc | Keywords: | Regression |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | libvirt-0.9.4-0rc1.2.el6 | Doc Type: | Bug Fix |
Doc Text: |
If either source or destination libvirtd was restarted when a migrating QEMU domain, the migration wasn't properly canceled and a stale domain may be left on target host and/or the domain may end in an unexpected state on source host. With this update, libvirtd tracks ongoing migrations in a persistent file, and properly cancels them when it is restarted.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2011-12-06 11:03:30 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 727249 |
Description
David Naori
2011-03-23 14:12:09 UTC
*** Bug 689921 has been marked as a duplicate of this bug. *** I haven't looked into this closely yet. We already know that aborting libvirtd during a migration is not necessarily the best action. Furthermore, it might not be possible to fix this without going to migration v3, where there is one additional round of handshaking between the source and destination (the current migration code uses v2, which has no way to feed back failure from the source to the destination, which explains why the destination is in a bad state when the source gave up). But I've got enough of a setup to try and reproduce this one locally this week, to see if I can come up with any quick fixes without having to got to migration v3. More on migration v3 (still not upstream at this point...) https://www.redhat.com/archives/libvir-list/2011-February/msg00259.html QE managed to reproduce this bug with following scenarios [Scenario 1 -- Single guest] 1. on source - virsh migrate --p2p --live FC qemu+tls://rhev-i32c-02.mpc.lab.eng.bos.redhat.com/system 2. after ~2 seconds/etc/init.d/libvirtd restart 3. on destination - virsh destroy FC [Scenario 2 -- 256 guests] 1. on source - virsh migrate --p2p --live FC qemu+tls://rhev-i32c-02.mpc.lab.eng.bos.redhat.com/system 2. after ~5 seconds/etc/init.d/libvirtd restart 3. on destination - virsh destroy FC Things I can tell here is that the host should be under stress , to make the migration run slowly , then it is easy to meet this bug I still haven't finished analyzing all aspects of restarting libvirtd during migration. However, I agree with Eric that migration v3 will help with detecting some migration failures and recovering from them. But we will definitely need more than migration v3 to support libvirtd restart during migration. I implemented probably the most invasive part to get this bz fixed (actually to fix what can be fixed) which is to preserve where exactly we are in the process of migration so that libvirtd knows what to do when it restarts. I'm currently testing this implementation. Testing goes well and everything seems to work except for few corner cases which I identified so far. I'll fix them and send the series of (currently) 18 patches upstream on Thursday. Patches set for upstream review: https://www.redhat.com/archives/libvir-list/2011-July/msg00384.html Almost done - all the patches from that series are upstream, ending with this commit: commit f9a837da73a11ef106a12a530e292f2ecb093016 Author: Jiri Denemark <jdenemar> Date: Tue Jul 19 02:27:39 2011 +0200 qemu: Remove special case for virDomainAbortJob This doesn't abort migration job in any phase, yet. but it introduced a regression in 'virsh managedsave' that I'm still trying to patch: https://www.redhat.com/archives/libvir-list/2011-July/msg02026.html (In reply to comment #18) > Almost done - all the patches from that series are upstream, ending with this > commit: > > commit f9a837da73a11ef106a12a530e292f2ecb093016 > but it introduced a regression in 'virsh managedsave' that I'm still trying to > patch: > https://www.redhat.com/archives/libvir-list/2011-July/msg02026.html That regression is now split into bug 727249; moving this to POST. (In reply to comment #18) > Almost done - all the patches from that series are upstream, ending with this > commit: > > commit f9a837da73a11ef106a12a530e292f2ecb093016 Also need this: commit f362a99a53a7e916d4c957a8c5dea819e3481fbc Author: Osier Yang <jyang> Date: Mon Aug 1 19:41:07 2011 +0800 qemu: Fix a regression of domjobabort Introduced by f9a837da73a11ef, the condition is not changed after the else clause is removed. So now it quit with "domain is not running" when the domain is running. However, when the domain is not running, it reports "no job is active". verify pass on libvirt-0.9.4-0rc1.2.el6.x86_64 kernel-2.6.32-171.el6.x86_64 qemu-kvm-0.12.1.2-2.172.el6.x86_64 after restart libvirtd, there is no same guest exists on target. And on source, guest is still running. Move this bug to verified according to comment 21. *** Bug 722417 has been marked as a duplicate of this bug. *** Technical note added. If any revisions are required, please edit the "Technical Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: If either source or destination libvirtd was restarted when a migrating QEMU domain, the migration wasn't properly canceled and a stale domain may be left on target host and/or the domain may end in an unexpected state on source host. With this update, libvirtd tracks ongoing migrations in a persistent file, and properly cancels them when it is restarted. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1513.html |