Bug 690175

Summary:	[Libvirt][qemu-kvm] Split-brain when migrating vm and restarting libvirtd.
Product:	Red Hat Enterprise Linux 6	Reporter:	David Naori <dnaori>
Component:	libvirt	Assignee:	Jiri Denemark <jdenemar>
Status:	CLOSED ERRATA	QA Contact:	Virtualization Bugs <virt-bugs>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	6.1	CC:	dallan, dnaori, dyuan, eblake, hateya, jhenner, mgoldboi, vbian, weizhan, yoyzhang
Target Milestone:	rc	Keywords:	Regression
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	libvirt-0.9.4-0rc1.2.el6	Doc Type:	Bug Fix
Doc Text:	If either source or destination libvirtd was restarted when a migrating QEMU domain, the migration wasn't properly canceled and a stale domain may be left on target host and/or the domain may end in an unexpected state on source host. With this update, libvirtd tracks ongoing migrations in a persistent file, and properly cancels them when it is restarted.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-12-06 11:03:30 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	727249

Description David Naori 2011-03-23 14:12:09 UTC

Description of problem:
When migrating vm and restarting libvirtd - qemu presses is still alive on source and another one created on the destination.

virsh -r list:

source:

 Id Name                 State
----------------------------------
143 FC                   running

destination: 

 Id Name                 State
----------------------------------
  2 FC                   paused


When trying to destroy this domain error occurs:

error: Failed to destroy domain FC
error: Timed out during operation: cannot acquire state change lock

  
Version-Release number of selected component (if applicable):

libvirt-0.8.7-13.el6.x86_64
qemu-kvm-0.12.1.2-2.149.el6.x86_64


How reproducible:
100%

Steps to Reproduce:
1. on source - virsh migrate --p2p --live FC qemu+tls://rhev-i32c-02.mpc.lab.eng.bos.redhat.com/system
2. after about ~5 seconds/etc/init.d/libvirtd restart 
3. on destination - virsh destroy FC


Actual results:
2 qemu prossess to the same vm

Expected results:

destination domain should be destroyed due to unsuccessful migration

Comment 2 Dan Kenigsberg 2011-03-28 15:48:18 UTC

*** Bug 689921 has been marked as a duplicate of this bug. ***

Comment 3 Eric Blake 2011-04-05 22:33:09 UTC

I haven't looked into this closely yet.  We already know that aborting libvirtd during a migration is not necessarily the best action.  Furthermore, it might not be possible to fix this without going to migration v3, where there is one additional round of handshaking between the source and destination (the current migration code uses v2, which has no way to feed back failure from the source to the destination, which explains why the destination is in a bad state when the source gave up).

But I've got enough of a setup to try and reproduce this one locally this week, to see if I can come up with any quick fixes without having to got to migration v3.

Comment 6 Eric Blake 2011-04-12 23:01:05 UTC

More on migration v3 (still not upstream at this point...)
https://www.redhat.com/archives/libvir-list/2011-February/msg00259.html

Comment 8 Vivian Bian 2011-04-15 09:28:04 UTC

QE managed to reproduce this bug with following scenarios 

[Scenario 1 -- Single guest]
1. on source - virsh migrate --p2p --live FC
qemu+tls://rhev-i32c-02.mpc.lab.eng.bos.redhat.com/system
2. after ~2 seconds/etc/init.d/libvirtd restart
3. on destination - virsh destroy FC

[Scenario 2 -- 256 guests]
1. on source - virsh migrate --p2p --live FC
qemu+tls://rhev-i32c-02.mpc.lab.eng.bos.redhat.com/system
2. after ~5 seconds/etc/init.d/libvirtd restart
3. on destination - virsh destroy FC


Things I can tell here is that the host should be under stress , to make the migration run slowly , then it is easy to meet this bug

Comment 13 Jiri Denemark 2011-05-03 11:36:54 UTC

I still haven't finished analyzing all aspects of restarting libvirtd during migration. However, I agree with Eric that migration v3 will help with detecting some migration failures and recovering from them. But we will definitely need more than migration v3 to support libvirtd restart during migration.

Comment 15 Jiri Denemark 2011-06-22 15:59:37 UTC

I implemented probably the most invasive part to get this bz fixed (actually to fix what can be fixed) which is to preserve where exactly we are in the process of migration so that libvirtd knows what to do when it restarts. I'm currently testing this implementation.

Comment 16 Jiri Denemark 2011-07-05 07:57:47 UTC

Testing goes well and everything seems to work except for few corner cases which I identified so far. I'll fix them and send the series of (currently) 18 patches upstream on Thursday.

Comment 17 Jiri Denemark 2011-07-08 08:48:55 UTC

Patches set for upstream review: https://www.redhat.com/archives/libvir-list/2011-July/msg00384.html

Comment 18 Eric Blake 2011-07-28 20:41:23 UTC

Almost done - all the patches from that series are upstream, ending with this commit:

commit f9a837da73a11ef106a12a530e292f2ecb093016
Author: Jiri Denemark <jdenemar>
Date:   Tue Jul 19 02:27:39 2011 +0200

    qemu: Remove special case for virDomainAbortJob
    
    This doesn't abort migration job in any phase, yet.

but it introduced a regression in 'virsh managedsave' that I'm still trying to patch:
https://www.redhat.com/archives/libvir-list/2011-July/msg02026.html

Comment 19 Eric Blake 2011-08-01 17:08:23 UTC

(In reply to comment #18)
> Almost done - all the patches from that series are upstream, ending with this
> commit:
> 
> commit f9a837da73a11ef106a12a530e292f2ecb093016

> but it introduced a regression in 'virsh managedsave' that I'm still trying to
> patch:
> https://www.redhat.com/archives/libvir-list/2011-July/msg02026.html

That regression is now split into bug 727249; moving this to POST.

Comment 20 Eric Blake 2011-08-01 21:19:58 UTC

(In reply to comment #18)
> Almost done - all the patches from that series are upstream, ending with this
> commit:
> 
> commit f9a837da73a11ef106a12a530e292f2ecb093016

Also need this:

commit f362a99a53a7e916d4c957a8c5dea819e3481fbc
Author: Osier Yang <jyang>
Date:   Mon Aug 1 19:41:07 2011 +0800

    qemu: Fix a regression of domjobabort
    
    Introduced by f9a837da73a11ef, the condition is not changed after
    the else clause is removed. So now it quit with "domain is not
    running" when the domain is running. However, when the domain is
    not running, it reports "no job is active".

Comment 21 weizhang 2011-08-02 07:15:56 UTC

verify pass on 
libvirt-0.9.4-0rc1.2.el6.x86_64
kernel-2.6.32-171.el6.x86_64
qemu-kvm-0.12.1.2-2.172.el6.x86_64

after restart libvirtd, there is no same guest exists on target. And on source, guest is still running.

Comment 23 dyuan 2011-08-03 07:35:45 UTC

Move this bug to verified according to comment 21.

Comment 24 Jiri Denemark 2011-08-04 08:58:38 UTC

*** Bug 722417 has been marked as a duplicate of this bug. ***

Comment 25 Jiri Denemark 2011-11-14 14:24:20 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
If either source or destination libvirtd was restarted when a migrating QEMU domain, the migration wasn't properly canceled and a stale domain may be left on target host and/or the domain may end in an unexpected state on source host. With this update, libvirtd tracks ongoing migrations in a persistent file, and properly cancels them when it is restarted.

Comment 26 errata-xmlrpc 2011-12-06 11:03:30 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1513.html