Bug 725373

Summary: [libvirt] when using domabortjob to abort stuck migration , the migration command still hangs.
Product: Red Hat Enterprise Linux 6 Reporter: David Naori <dnaori>
Component: libvirtAssignee: Jiri Denemark <jdenemar>
Status: CLOSED ERRATA QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: high    
Version: 6.1CC: abaron, berrange, dallan, danken, dyuan, hateya, mgoldboi, mzhan, rwu, vbian, veillard, weizhan, wolfram, ykaul, yupzhang
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: libvirt-0.9.9-1.el6 Doc Type: Bug Fix
Doc Text:
When a destination host lost network connectivity while a domain was being migrated to it, the migration process could not be canceled. While the cancellation request itself succeeded, following steps needed to make both sides aware of the cancellation still blocked which prevented migration protocol from finishing. In Red Hat Enterprise Linux 6.3, libvirt implements internal keep alive protocol, which is able to detect broken connections or blocked libvirt daemons (within 30 seconds with default configuration). When such situation is detected during migration, libvirt automatically cancels it.
Story Points: ---
Clone Of:
: 799478 (view as bug list) Environment:
Last Closed: 2012-06-20 06:29:28 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 669581    
Bug Blocks: 723198, 773650, 773651, 773677, 773696, 799478    
Attachments:
Description Flags
libvirtd log
none
source libvirtd log
none
dest libvirtd log none

Description David Naori 2011-07-25 10:39:39 UTC
Created attachment 515009 [details]
libvirtd log

Description of problem:
when using domabortjob to abort stuck migration (destination host blocked during migration via iptables) , the migration command still hangs.

Version-Release number of selected component (if applicable):
libvirt-python-0.9.3-7.el6.x86_64

How reproducible:
100%

Steps:

[root@magenta-vds6 ~]# virsh -r list
 Id Name                 State
----------------------------------
  8 NFS-FOR-TEMP         running

[root@magenta-vds6 ~]# virsh migrate NFS-FOR-TEMP --live --p2p qemu+tls://10.35.116.1/system


on other shell:
[root@magenta-vds6 ~]# iptables -A OUTPUT -d 10.35.116.1 -j REJECT
[root@magenta-vds6 ~]# virsh domjobabort NFS-FOR-TEMP 

Actual results:
the migration command still hangs.

Expected results:
the migration should return.

Additional info:
libvirtd logs attached.

Comment 2 Dave Allan 2011-08-04 14:00:33 UTC
I'm not 100% certain that it's the same behavior since I'm using different migration parameters, but I am able to reproduce this behavior with:

virsh migrate --persistent --undefinesource --tunnelled --p2p --live --desturi 'qemu+ssh://root@hybrid0/system' spice

Comment 3 Jiri Denemark 2011-08-04 15:07:24 UTC
There are several issues we're hitting in this scenario:

- If virsh domjobabort is called when migration is not in the perform phase,
  it has no effect.

- If the connectivity between src and dest machines is lost (using iptables in
  this case) while src qemu is still sending data to dest qemu, even
  domjobabort may hang because qemu blocks within migrate_cancel command.
  That's because it tries to flush its buffers first, which may block in
  send() because of full connection buffers. Blocked migrate_cancel command
  means that next attempt to send query-migrate command to qemu times out on
  job condition and perform phase fails. So this situation is only partly bad.

- Finally, if we manage to get out of the perform phase, we call Finish API on
  dest, which hangs because we never see a response to it.

The good thing is that it shouldn't hang forever. Only for the time until TCP
connections time out. It took about 15 minutes for me to return to a normal
state on src host. It needs more time for dest host to realize the connection
was lost and that the domain should be automatically destroyed there.

Comment 4 Jiri Denemark 2011-08-08 10:59:59 UTC
So I think we need three things:

- make domjobabort smarter so that it can ideally cancel migration in any
  phase (I opened a new bug 728904 for this)

- use fd: protocol for migration instead of relying on qemu to open a tcp
  connection to destination qemu and send data through it (tracked by bug
  720269)

- introduce in-protocol keep-alive messages so that dead peer (host or just
  libvirtd) can be detected faster

Let's leave this BZ to track the last of the three things.

Comment 7 Jiri Denemark 2011-08-22 14:57:04 UTC
(In reply to comment #4)
> ...
> - use fd: protocol for migration instead of relying on qemu to open a tcp
>   connection to destination qemu and send data through it (tracked by bug
>   720269)

This was based on my plans to make libvirtd a proxy between source and destination qemu in which case we could cancel the migration no matter what qemu does. The idea appeared not to be as good as I thought. Instead, the migrate_cancel command should be fixed in qemu to cancel the migration without blocking.

Comment 8 Jiri Denemark 2011-08-26 10:46:44 UTC
Bug 669581 tracks the migrate_cancel bug in qemu.

Comment 9 Jiri Denemark 2011-09-23 08:29:54 UTC
Series adding keepalive messages into libvirt RPC protocol was sent upstream for review: https://www.redhat.com/archives/libvir-list/2011-September/msg00938.html

Comment 12 Jiri Denemark 2011-11-24 17:19:24 UTC
Support for keepalive messages is now upstream and will be included in 0.9.8 release. The series ends with commit v0.9.7-157-g2c4cdb7.

Comment 14 weizhang 2012-01-10 03:31:35 UTC
Test with 
kernel-2.6.32-223.el6.x86_64
qemu-kvm-0.12.1.2-2.213.el6.x86_64
libvirt-0.9.9-1.el6.x86_64

1. Do migration
#virsh migrate --live --p2p kvm-rhel6u2-x86_64-new  qemu+tls://10.66.83.197/system
At the same time, on other console do
#iptables -A OUTPUT -d 10.66.83.197 -j REJECT
Then do
#virsh domjobabort kvm-rhel6u2-x86_64-new

The migration job will not return immediately, but will wait for a while then return 
error: An error occurred, but the cause is unknown

# virsh list
 Id Name                 State
----------------------------------
  8 kvm-rhel6u2-x86_64-new running

But on target host, virsh list will hang forever. It that OK?

Comment 15 weizhang 2012-01-16 12:32:12 UTC
Retest on 
kernel-2.6.32-225.el6.x86_64
libvirt-0.9.9-1.el6.x86_64
qemu-kvm-0.12.1.2-2.213.el6.x86_64

can not reproduce the phenomemon on comment 14, everything works well, so verify pass

Comment 16 Jiri Denemark 2012-01-16 12:40:09 UTC
Interesting... In case you hit it again in the future, please attach debug logs from both source and destination libvirt daemons. And always be sure you run virsh domjobabort after migration data starts flowing from source to destination. Running it earlier doesn't provide expected results (see bug 728904).

Comment 17 weizhang 2012-02-20 08:59:50 UTC
Hi Jiri,

I reproduce the phenomemon on comment 14 again on 
kernel-2.6.32-230.el6.x86_64
qemu-kvm-0.12.1.2-2.231.el6.x86_64
libvirt-0.9.10-1.el6.x86_64

I will attach the debug log but it is a little big.

Comment 18 weizhang 2012-02-20 09:02:06 UTC
Created attachment 564347 [details]
source libvirtd log

Comment 19 weizhang 2012-02-20 09:02:57 UTC
Created attachment 564348 [details]
dest libvirtd log

Comment 20 weizhang 2012-02-29 11:39:35 UTC
Hi Jiri,

Do I need to re-assign the bug?

Comment 21 weizhang 2012-02-29 11:49:56 UTC
still can reproduce the phenomemon on comment 14 on
qemu-kvm-0.12.1.2-2.232.el6.x86_64
kernel-2.6.32-225.el6.x86_64
libvirt-0.9.10-3.el6.x86_64

so re-assign this bug

Comment 22 Jiri Denemark 2012-02-29 15:46:32 UTC
Oh I see, what it is. virsh domjobabort correctly cancels the migration and source libvirtd tries to call Finish API on destination libvirtd to tell it about the abortion. Since all packets to destination libvirtd are discarded, the API waits for some time (~30 seconds) until the broken connection is detected and than source libvirtd correctly resumes the domain and reports the migration API finished. So far, everything worked as expected. However, it seems the error message was eaten somewhere on the way and virsh has nothing useful to report.

Comment 23 Jiri Denemark 2012-03-02 17:30:57 UTC
Actually the eaten error message is a minor issue which doesn't need to block this bug. I'm moving this bz back to ON_QA and I'll create a new bug requesting the error message to be fixed.

Comment 24 weizhang 2012-03-05 02:34:41 UTC
But another problem is that on target machine "virsh list" will hang, then do I need to file a new bug for it?

Comment 25 Jiri Denemark 2012-03-05 08:58:11 UTC
Oh, I somehow missed that... Let me check quickly.

Comment 26 Jiri Denemark 2012-03-06 07:04:59 UTC
OK, the part handled by this BZ seems to be working just fine. Destination libvirtd correctly detects the connection to the source is broken and starts to destroy the migrated domain. But libvirtd seems to hang while doing so. Any chance the destination host also lost connection to the shared storage used by the migrated domain? Perhaps because it used nfs from the source host?

Comment 27 weizhang 2012-03-06 08:37:47 UTC
(In reply to comment #26)
> OK, the part handled by this BZ seems to be working just fine. Destination
> libvirtd correctly detects the connection to the source is broken and starts to
> destroy the migrated domain. But libvirtd seems to hang while doing so. Any
> chance the destination host also lost connection to the shared storage used by
> the migrated domain? Perhaps because it used nfs from the source host?

Yes, you are right. The virsh list hang is because I use the nfs on source host. With using the third host as nfs server, on target machine virsh list will not hang. But is that a new problem?

Comment 28 weizhang 2012-03-06 08:44:24 UTC
Any way, this bug can be verified as domjobabort really abort the migration. So change the status to verified

Comment 29 Jiri Denemark 2012-03-06 13:26:29 UTC
(In reply to comment #27)
> (In reply to comment #26)
> > OK, the part handled by this BZ seems to be working just fine. Destination
> > libvirtd correctly detects the connection to the source is broken and starts to
> > destroy the migrated domain. But libvirtd seems to hang while doing so. Any
> > chance the destination host also lost connection to the shared storage used by
> > the migrated domain? Perhaps because it used nfs from the source host?
> 
> Yes, you are right. The virsh list hang is because I use the nfs on source
> host. With using the third host as nfs server, on target machine virsh list
> will not hang. But is that a new problem?

Hmm, AFAIK we had similar bugs in the past, e.g., bug 746666, which is supposed to be fixed already. I think the safest way is to file a new bug for the hang you see.

Comment 30 Jiri Denemark 2012-05-10 13:36:47 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
When a destination host lost network connectivity while a domain was being
migrated to it, the migration process could not be canceled. While the
cancellation request itself succeeded, following steps needed to make both
sides aware of the cancellation still blocked which prevented migration
protocol from finishing. In Red Hat Enterprise Linux 6.3, libvirt implements
internal keep alive protocol, which is able to detect broken connections or
blocked libvirt daemons (within 30 seconds with default configuration). When
such situation is detected during migration, libvirt automatically cancels it.

Comment 32 errata-xmlrpc 2012-06-20 06:29:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2012-0748.html