Description of problem: Down the network interface on target host during migration, virsh keepalive times out after k*(K+1)s, but virsh hangs. # virsh -k2 -K20 migrate r71 qemu+ssh://10.66.4.208/system --verbose root.4.208's password: Migration: [ 71 %]2015-08-24 04:48:36.289+0000: 24722: info : libvirt version: 1.2.17, package: 5.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-08-13-18:08:20, x86-024.build.eng.bos.redhat.com) 2015-08-24 04:48:36.289+0000: 24722: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f7b930f3f10 after 20 keepalive messages in 42 seconds Migration: [ 71 %] Version-Release number of selected component (if applicable): libvirt-1.2.17-5.el7.x86_64 qemu-kvm-rhev-2.3.0-18.el7.x86_64 How reproducible: 100% Steps to Reproduce: 1.Do migration: [root@fjin-4-141 test]# virsh -k2 -K20 migrate r71 qemu+ssh://10.66.4.208/system --verbose root.4.208's password: Migration: [ 71 %] 2.Down the network on target host during migration: # ifconfig enp0s25 down 3.Wait 42s, virsh client print error message that keepalive times out, and hangs. [root@fjin-4-141 test]# virsh -k2 -K20 migrate r71 qemu+ssh://10.66.4.208/system --verbose root.4.208's password: Migration: [ 71 %]2015-08-24 04:48:36.289+0000: 24722: info : libvirt version: 1.2.17, package: 5.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-08-13-18:08:20, x86-024.build.eng.bos.redhat.com) 2015-08-24 04:48:36.289+0000: 24722: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f7b930f3f10 after 20 keepalive messages in 42 seconds Migration: [ 71 %] 4.Wait a few minutes, restore the network on target host: # ifconfig enp0s25 up Virsh client returns with error: Migration: [ 71 %]error: operation failed: migration job: unexpectedly failed Actual results: As steps Expected results: Migration failed and virsh client can return immediately after keepalive times out. Additional info:
Created attachment 1066202 [details] libvirtd debug log
Created attachment 1066203 [details] gdb output
What's the behaviour when you press Ctrl-C after that message about keepalive timeout is printed out?
(In reply to Martin Kletzander from comment #4) > What's the behaviour when you press Ctrl-C after that message about > keepalive timeout is printed out? After press Ctrl-C, it prints out "migration job: canceled by client" and exits: [root@fjin-4-141 test]# virsh -k2 -K20 migrate rhel6.6-GUI --live --verbose qemu+ssh://10.66.4.208/system root.4.208's password: Migration: [ 71 %]2015-08-25 01:55:44.625+0000: 22951: info : libvirt version: 1.2.17, package: 6.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-08-21-20:23:32, x86-035.build.eng.bos.redhat.com) 2015-08-25 01:55:44.625+0000: 22951: warning : virKeepAliveTimerInternal:143 : No response from client 0x7fe50206bf70 after 20 keepalive messages in 42 seconds Migration: [ 71 %]^Cerror: operation aborted: migration job: canceled by client [root@fjin-4-141 test]#
I tried with build libvirt-1.2.17-3.el7+bz1256213.x86_64, the behaviour is same as before.
I'm still trying to figure out how is it possible for you to get that kinds of outputs. Looking at everything, the most interesting part I notice is that it really looks like you're getting disconnections from the target host, but virsh does not set up keepalive for the destination when migrating. And I can't reproduce it following the steps that you have. What's the output of 'virsh uri'?
I can still get the same results as before following the steps. The output of 'virsh uri' on both source and target is : # virsh uri qemu:///system And why do you say "virsh does not set up keepalive for the destination when migrating"? I think the message virsh printed after disconnection can indicate it had keepalive for the destination: warning : virKeepAliveTimerInternal:143 : No response from client 0x7fe50206bf70 after 20 keepalive messages in 42 seconds
I meant htat no matter which way I look at the source, virsh only sets up client keepalive on the connection to source, not destination. And that's not even considering p2p migrations and the like.
This may be already fixed by patches I pushed upstream some time ago. Could you please retest this with the current version of libvirt?
Tried on build libvirt-1.3.5-1.el7.x86_64, it seems the problem still exists. Steps: 1.Do migration: # time virsh -k2 -K20 migrate rhel6 qemu+ssh://hp-dl385g7-06.lab.eng.pek2.redhat.com/system --verbose --live Migration: [ 1 %] 2. Before migration completes, on target host, do: # iptables -A OUTPUT -s <source ip> -j DROP # iptables -A INTPUT -s <source ip> -j DROP 3. Wait more than 1 minute, virsh doesn't exit and no error message outputs: # time virsh -k2 -K20 migrate rhel6 qemu+ssh://hp-dl385g7-06.lab.eng.pek2.redhat.com/system --verbose --live Migration: [ 1 %] 4. On target host: # iptables -F 5. Wait a while, virsh exits: # time virsh -k2 -K20 migrate rhel6 qemu+ssh://hp-dl385g7-06.lab.eng.pek2.redhat.com/system --verbose --live Migration: [ 1 %]error: operation failed: migration job: unexpectedly failed real 2m1.763s user 0m0.043s sys 0m0.062s
It works fine with peer-to-peer migration controlled by the source libvirtd. This issue only affects non-p2p migration when the client controls the migration by calling several APIs on each side of the migration. The source libvirtd cannot see when a connection between the client and the destination host breaks and thus it cannot automatically abort the migration. The client itself will need to do this.
*** Bug 1367620 has been marked as a duplicate of this bug. ***
This bug is going to be addressed in next major release.
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.