1256213 – "Virsh migrate" hangs after virsh keepalive times out.

Bug 1256213 - "Virsh migrate" hangs after virsh keepalive times out.

Summary: "Virsh migrate" hangs after virsh keepalive times out.

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux Advanced Virtualization
Classification:	Red Hat
Component:	libvirt
Sub Component:
Version:	8.0
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	8.1
Assignee:	Virtualization Maintenance
QA Contact:	Fangge Jin
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1288337
TreeView+	depends on / blocked

Reported:	2015-08-24 05:01 UTC by Fangge Jin
Modified:	2021-05-15 07:30 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-05-15 07:30:42 UTC
Type:	Bug
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
libvirtd debug log (6.66 MB, text/plain) 2015-08-24 05:04 UTC, Fangge Jin	no flags	Details
gdb output (5.30 KB, text/plain) 2015-08-24 05:05 UTC, Fangge Jin	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1073506	0	low	CLOSED	[RFE] Add keepalive support into virsh	2021-02-22 00:41:40 UTC

Internal Links: 1073506

Description Fangge Jin 2015-08-24 05:01:31 UTC

Description of problem:
Down the network interface on target host during migration, virsh keepalive times out after k*(K+1)s, but virsh hangs.
# virsh -k2 -K20 migrate r71 qemu+ssh://10.66.4.208/system --verbose
root.4.208's password: 
Migration: [ 71 %]2015-08-24 04:48:36.289+0000: 24722: info : libvirt version: 1.2.17, package: 5.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-08-13-18:08:20, x86-024.build.eng.bos.redhat.com)
2015-08-24 04:48:36.289+0000: 24722: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f7b930f3f10 after 20 keepalive messages in 42 seconds
Migration: [ 71 %]

Version-Release number of selected component (if applicable):
libvirt-1.2.17-5.el7.x86_64
qemu-kvm-rhev-2.3.0-18.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1.Do migration:
[root@fjin-4-141 test]# virsh -k2 -K20 migrate r71 qemu+ssh://10.66.4.208/system --verbose
root.4.208's password: 
Migration: [ 71 %]

2.Down the network on target host during migration:
# ifconfig enp0s25 down

3.Wait 42s, virsh client print error message that keepalive times out, and hangs.
[root@fjin-4-141 test]# virsh -k2 -K20 migrate r71 qemu+ssh://10.66.4.208/system --verbose
root.4.208's password: 
Migration: [ 71 %]2015-08-24 04:48:36.289+0000: 24722: info : libvirt version: 1.2.17, package: 5.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-08-13-18:08:20, x86-024.build.eng.bos.redhat.com)
2015-08-24 04:48:36.289+0000: 24722: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f7b930f3f10 after 20 keepalive messages in 42 seconds
Migration: [ 71 %]

4.Wait a few minutes, restore the network on target host:
# ifconfig enp0s25 up

 Virsh client returns with error:
Migration: [ 71 %]error: operation failed: migration job: unexpectedly failed


Actual results:
As steps

Expected results:
Migration failed and virsh client can return immediately after keepalive times out.

Additional info:

Comment 1 Fangge Jin 2015-08-24 05:04:29 UTC

Created attachment 1066202 [details]
libvirtd debug log

Comment 2 Fangge Jin 2015-08-24 05:05:17 UTC

Created attachment 1066203 [details]
gdb output

Comment 4 Martin Kletzander 2015-08-24 11:30:05 UTC

What's the behaviour when you press Ctrl-C after that message about keepalive timeout is printed out?

Comment 5 Fangge Jin 2015-08-25 01:59:00 UTC

(In reply to Martin Kletzander from comment #4)
> What's the behaviour when you press Ctrl-C after that message about
> keepalive timeout is printed out?

After press Ctrl-C, it prints out "migration job: canceled by client" and exits:

[root@fjin-4-141 test]# virsh -k2 -K20 migrate rhel6.6-GUI --live --verbose qemu+ssh://10.66.4.208/system 
root.4.208's password: 
Migration: [ 71 %]2015-08-25 01:55:44.625+0000: 22951: info : libvirt version: 1.2.17, package: 6.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-08-21-20:23:32, x86-035.build.eng.bos.redhat.com)
2015-08-25 01:55:44.625+0000: 22951: warning : virKeepAliveTimerInternal:143 : No response from client 0x7fe50206bf70 after 20 keepalive messages in 42 seconds
Migration: [ 71 %]^Cerror: operation aborted: migration job: canceled by client

[root@fjin-4-141 test]#

Comment 7 Fangge Jin 2015-08-26 01:52:58 UTC

I tried with build libvirt-1.2.17-3.el7+bz1256213.x86_64, the behaviour is same as before.

Comment 10 Martin Kletzander 2015-09-02 13:37:46 UTC

I'm still trying to figure out how is it possible for you to get that kinds of outputs.  Looking at everything, the most interesting part I notice is that it really looks like you're getting disconnections from the target host, but virsh does not set up keepalive for the destination when migrating.  And I can't reproduce it following the steps that you have.  What's the output of 'virsh uri'?

Comment 11 Fangge Jin 2015-09-06 02:14:45 UTC

I can still get the same results as before following the steps. The output of 'virsh uri' on both source and target is :
# virsh uri
qemu:///system

And why do you say "virsh does not set up keepalive for the destination when migrating"?  I think the message virsh printed after disconnection can indicate it had keepalive for the destination:
warning : virKeepAliveTimerInternal:143 : No response from client 0x7fe50206bf70 after 20 keepalive messages in 42 seconds

Comment 12 Martin Kletzander 2015-09-14 13:27:30 UTC

I meant htat no matter which way I look at the source, virsh only sets up client keepalive on the connection to source, not destination.  And that's not even considering p2p migrations and the like.

Comment 14 Jiri Denemark 2016-06-28 17:56:43 UTC

This may be already fixed by patches I pushed upstream some time ago.

Could you please retest this with the current version of libvirt?

Comment 15 Fangge Jin 2016-06-30 03:13:44 UTC

Tried on build libvirt-1.3.5-1.el7.x86_64, it seems the problem still exists.

Steps:
1.Do migration:
# time virsh -k2 -K20 migrate rhel6 qemu+ssh://hp-dl385g7-06.lab.eng.pek2.redhat.com/system --verbose --live
Migration: [  1 %]

2. Before migration completes, on target host, do:
#  iptables -A OUTPUT -s <source ip> -j DROP
#  iptables -A INTPUT -s <source ip> -j DROP

3. Wait more than 1 minute, virsh doesn't exit and no error message outputs:
# time virsh -k2 -K20 migrate rhel6 qemu+ssh://hp-dl385g7-06.lab.eng.pek2.redhat.com/system --verbose --live
Migration: [  1 %]

4. On target host:
# iptables -F

5. Wait a while, virsh exits:
# time virsh -k2 -K20 migrate rhel6 qemu+ssh://hp-dl385g7-06.lab.eng.pek2.redhat.com/system --verbose --live
Migration: [  1 %]error: operation failed: migration job: unexpectedly failed


real	2m1.763s
user	0m0.043s
sys	0m0.062s

Comment 16 Jiri Denemark 2017-11-16 15:45:32 UTC

It works fine with peer-to-peer migration controlled by the source libvirtd. This issue only affects non-p2p migration when the client controls the migration by calling several APIs on each side of the migration. The source libvirtd cannot see when a connection between the client and the destination host breaks and thus it cannot automatically abort the migration. The client itself will need to do this.

Comment 17 Jiri Denemark 2018-09-03 14:55:56 UTC

*** Bug 1367620 has been marked as a duplicate of this bug. ***

Comment 18 Jiri Denemark 2019-04-25 08:21:35 UTC

This bug is going to be addressed in next major release.

Comment 22 RHEL Program Management 2021-05-15 07:30:42 UTC

After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.

Note You need to log in before you can comment on or make changes to this bug.