Bug 1256213

Summary: "Virsh migrate" hangs after virsh keepalive times out.
Product: Red Hat Enterprise Linux Advanced Virtualization Reporter: Fangge Jin <fjin>
Component: libvirtAssignee: Virtualization Maintenance <virt-maint>
Status: CLOSED WONTFIX QA Contact: Fangge Jin <fjin>
Severity: medium Docs Contact:
Priority: low    
Version: 8.0CC: dyuan, dzheng, fjin, jdenemar, jsuchane, knoel, mvanderw, mzhan, xuzhang, zpeng
Target Milestone: rcKeywords: Triaged
Target Release: 8.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-05-15 07:30:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1288337    
Attachments:
Description Flags
libvirtd debug log
none
gdb output none

Description Fangge Jin 2015-08-24 05:01:31 UTC
Description of problem:
Down the network interface on target host during migration, virsh keepalive times out after k*(K+1)s, but virsh hangs.
# virsh -k2 -K20 migrate r71 qemu+ssh://10.66.4.208/system --verbose
root.4.208's password: 
Migration: [ 71 %]2015-08-24 04:48:36.289+0000: 24722: info : libvirt version: 1.2.17, package: 5.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-08-13-18:08:20, x86-024.build.eng.bos.redhat.com)
2015-08-24 04:48:36.289+0000: 24722: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f7b930f3f10 after 20 keepalive messages in 42 seconds
Migration: [ 71 %]

Version-Release number of selected component (if applicable):
libvirt-1.2.17-5.el7.x86_64
qemu-kvm-rhev-2.3.0-18.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Do migration:
[root@fjin-4-141 test]# virsh -k2 -K20 migrate r71 qemu+ssh://10.66.4.208/system --verbose
root.4.208's password: 
Migration: [ 71 %]

2.Down the network on target host during migration:
# ifconfig enp0s25 down

3.Wait 42s, virsh client print error message that keepalive times out, and hangs.
[root@fjin-4-141 test]# virsh -k2 -K20 migrate r71 qemu+ssh://10.66.4.208/system --verbose
root.4.208's password: 
Migration: [ 71 %]2015-08-24 04:48:36.289+0000: 24722: info : libvirt version: 1.2.17, package: 5.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-08-13-18:08:20, x86-024.build.eng.bos.redhat.com)
2015-08-24 04:48:36.289+0000: 24722: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f7b930f3f10 after 20 keepalive messages in 42 seconds
Migration: [ 71 %]

4.Wait a few minutes, restore the network on target host:
# ifconfig enp0s25 up

 Virsh client returns with error:
Migration: [ 71 %]error: operation failed: migration job: unexpectedly failed


Actual results:
As steps

Expected results:
Migration failed and virsh client can return immediately after keepalive times out.

Additional info:

Comment 1 Fangge Jin 2015-08-24 05:04:29 UTC
Created attachment 1066202 [details]
libvirtd debug log

Comment 2 Fangge Jin 2015-08-24 05:05:17 UTC
Created attachment 1066203 [details]
gdb output

Comment 4 Martin Kletzander 2015-08-24 11:30:05 UTC
What's the behaviour when you press Ctrl-C after that message about keepalive timeout is printed out?

Comment 5 Fangge Jin 2015-08-25 01:59:00 UTC
(In reply to Martin Kletzander from comment #4)
> What's the behaviour when you press Ctrl-C after that message about
> keepalive timeout is printed out?

After press Ctrl-C, it prints out "migration job: canceled by client" and exits:

[root@fjin-4-141 test]# virsh -k2 -K20 migrate rhel6.6-GUI --live --verbose qemu+ssh://10.66.4.208/system 
root.4.208's password: 
Migration: [ 71 %]2015-08-25 01:55:44.625+0000: 22951: info : libvirt version: 1.2.17, package: 6.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-08-21-20:23:32, x86-035.build.eng.bos.redhat.com)
2015-08-25 01:55:44.625+0000: 22951: warning : virKeepAliveTimerInternal:143 : No response from client 0x7fe50206bf70 after 20 keepalive messages in 42 seconds
Migration: [ 71 %]^Cerror: operation aborted: migration job: canceled by client

[root@fjin-4-141 test]#

Comment 7 Fangge Jin 2015-08-26 01:52:58 UTC
I tried with build libvirt-1.2.17-3.el7+bz1256213.x86_64, the behaviour is same as before.

Comment 10 Martin Kletzander 2015-09-02 13:37:46 UTC
I'm still trying to figure out how is it possible for you to get that kinds of outputs.  Looking at everything, the most interesting part I notice is that it really looks like you're getting disconnections from the target host, but virsh does not set up keepalive for the destination when migrating.  And I can't reproduce it following the steps that you have.  What's the output of 'virsh uri'?

Comment 11 Fangge Jin 2015-09-06 02:14:45 UTC
I can still get the same results as before following the steps. The output of 'virsh uri' on both source and target is :
# virsh uri
qemu:///system

And why do you say "virsh does not set up keepalive for the destination when migrating"?  I think the message virsh printed after disconnection can indicate it had keepalive for the destination:
warning : virKeepAliveTimerInternal:143 : No response from client 0x7fe50206bf70 after 20 keepalive messages in 42 seconds

Comment 12 Martin Kletzander 2015-09-14 13:27:30 UTC
I meant htat no matter which way I look at the source, virsh only sets up client keepalive on the connection to source, not destination.  And that's not even considering p2p migrations and the like.

Comment 14 Jiri Denemark 2016-06-28 17:56:43 UTC
This may be already fixed by patches I pushed upstream some time ago.

Could you please retest this with the current version of libvirt?

Comment 15 Fangge Jin 2016-06-30 03:13:44 UTC
Tried on build libvirt-1.3.5-1.el7.x86_64, it seems the problem still exists.

Steps:
1.Do migration:
# time virsh -k2 -K20 migrate rhel6 qemu+ssh://hp-dl385g7-06.lab.eng.pek2.redhat.com/system --verbose --live
Migration: [  1 %]

2. Before migration completes, on target host, do:
#  iptables -A OUTPUT -s <source ip> -j DROP
#  iptables -A INTPUT -s <source ip> -j DROP

3. Wait more than 1 minute, virsh doesn't exit and no error message outputs:
# time virsh -k2 -K20 migrate rhel6 qemu+ssh://hp-dl385g7-06.lab.eng.pek2.redhat.com/system --verbose --live
Migration: [  1 %]

4. On target host:
# iptables -F

5. Wait a while, virsh exits:
# time virsh -k2 -K20 migrate rhel6 qemu+ssh://hp-dl385g7-06.lab.eng.pek2.redhat.com/system --verbose --live
Migration: [  1 %]error: operation failed: migration job: unexpectedly failed


real	2m1.763s
user	0m0.043s
sys	0m0.062s

Comment 16 Jiri Denemark 2017-11-16 15:45:32 UTC
It works fine with peer-to-peer migration controlled by the source libvirtd. This issue only affects non-p2p migration when the client controls the migration by calling several APIs on each side of the migration. The source libvirtd cannot see when a connection between the client and the destination host breaks and thus it cannot automatically abort the migration. The client itself will need to do this.

Comment 17 Jiri Denemark 2018-09-03 14:55:56 UTC
*** Bug 1367620 has been marked as a duplicate of this bug. ***

Comment 18 Jiri Denemark 2019-04-25 08:21:35 UTC
This bug is going to be addressed in next major release.

Comment 22 RHEL Program Management 2021-05-15 07:30:42 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.