Bug 1073506

Summary:

[RFE] Add keepalive support into virsh

Product:

Red Hat Enterprise Linux 7

Reporter:

Martin Kletzander <mkletzan>

Component:

libvirt

Assignee:

Martin Kletzander <mkletzan>

Status:

CLOSED ERRATA

QA Contact:

Virtualization Bugs <virt-bugs>

Severity:

low

Docs Contact:

Priority:

low

Version:

7.0

CC:

cwei, dyuan, fjin, gsun, herrold, jdenemar, juzhou, leiwan, lhuang, lmiksik, mkletzan, mzhan, rbalakri, veillard, weizhan, zhwang, zpeng

Target Milestone:

Keywords:

FutureFeature, Upstream

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

libvirt-1.2.13-1.el7

Doc Type:

Enhancement

Doc Text:

Story Points:

---

Clone Of:

822839

Environment:

Last Closed:

2015-11-19 05:45:07 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
gdb backtrace output	none

Description Martin Kletzander 2014-03-06 14:37:40 UTC

+++ This bug was initially created as a clone of Bug #822839 +++

Description of problem:
virsh command hangs while the migration is happening and target host's network is stopped

Version-Release number of selected component (if applicable):
libvirt-1.1.1-24.el7.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Start a domain
   # virsh start <domain>

2. Start live migration
   # virsh migrate --live <domain> qemu+ssh://host2/system

4. During the migration, stop network on target host
   # ip l se dev ens7 down

  
Actual results:
virsh hangs until the ssh connection times out

Expected results:
1. The migration should fail earlier

Comment 1 Martin Kletzander 2014-03-07 10:51:04 UTC

Patch proposed upstream:

https://www.redhat.com/archives/libvir-list/2014-March/msg00415.html

Comment 2 Martin Kletzander 2014-03-18 07:50:24 UTC

Fixed upstream with commit v1.2.2-201-g676cb4f:

commit 676cb4f4e762b8682a06c6dab1f690fbcd939550
Author: Martin Kletzander <mkletzan>
Date:   Thu Mar 6 17:20:11 2014 +0100

    virsh: Add keepalive in new vshConnect function

in series with optional commit v1.2.2-202-gb0cf7d6:

commit b0cf7d64614ea1000424534ebbd5738d254c7410
Author: Martin Kletzander <mkletzan>
Date:   Fri Mar 7 11:15:39 2014 +0100

    virsh: Prohibit virConnectOpen* functions in virsh

Comment 4 zhengqin 2014-08-18 08:42:40 UTC

I could reproduce this issue with libvirt-1.1.1-24.el7.x86_64.

Steps to Reproduce:
1. Start a domain
   # virsh start <domain>

2. Start live migration
   # virsh migrate --live <domain> qemu+ssh://host2/system

4. During the migration, stop network on target host
   # ip l se dev ens3 down

  
Actual results:
virsh hangs until the ssh connection times out

Comment 5 zhengqin 2014-08-19 08:07:54 UTC

I have verified this issue with libvirt-1.2.7-1.el7.x86_64:

Here are verify steps:

1. Start a domain
   # virsh start <domain>

2. Start live migration
   #  virsh -k1 -K5 migrate --live rhel6 qemu+ssh://xx.xx.xx.xx/system --verbose
root.xx.xx's password: 

4. During the migration, stop network on target host
   # ip l se dev ens3 down

  
Actual results:

Following Info will output:

2014-08-19 08:04:45.244+0000: 14860: info : libvirt version: 1.2.7, package: 1.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2014-08-04-04:57:05, x86-019.build.eng.bos.redhat.com)
2014-08-19 08:04:45.244+0000: 14860: warning : virKeepAliveTimerInternal:143 : No response from client 0x7fb014001320 after 5 keepalive messages in 6 seconds
2014-08-19 08:04:45.244+0000: 14861: warning : virKeepAliveTimerInternal:143 : No response from client 0x7fb014001320 after 5 keepalive messages in 6 seconds
error: internal error: received hangup / error event on socket

Comment 7 Luyao Huang 2014-12-05 09:34:11 UTC

Because the virsh client still hang after keep alive time out although the libvirtd is disconnected, and we think it will be good if the client do not
try to reconnect to libvirtd after time out. 

Seems we need another patch for this issue:

https://www.redhat.com/archives/libvir-list/2014-November/msg01080.html

So move back to assigned.

Thanks,
Luyao Huang

Comment 10 Martin Kletzander 2014-12-08 13:02:44 UTC

Fixed upstream by v1.2.10-284-g48abdf5 and v1.2.10-285-gf127138:

commit 48abdf5de7dbb81d4091235efc57cf07fd45cb86
Author: Martin Kletzander <mkletzan>
Date:   Mon Dec 1 11:46:14 2014 +0100

    virsh: Don't reconnect after the command when disconnected

commit f1271380381bb49b4a4c38950fe77a60d19ea9a3
Author: Martin Kletzander <mkletzan>
Date:   Sun Nov 30 20:09:08 2014 +0100

    rpc: Report proper close reason

Comment 13 Fangge Jin 2015-07-06 11:32:30 UTC

Hi, Martin,

I'm doing some test about libvirtd keepalive mech. I found that the client can reconnect to the server in one session after the network connection is recovered, is this the new design?

First on the server, configure:
keepalive_interval = 5
keepalive_count = 5

Scenario 1:
1.Connect to server on client:
# virsh -k0 -K0 -c qemu+ssh://10.66.6.6/system
root.6.6's password: 
Welcome to virsh, the virtualization interactive terminal.

Type:  'help' for help with commands
       'quit' to quit

virsh # 
2.Set iptables rule on client：
# iptables -A INPUT -s 10.66.6.6 -j DROP
3.Within 30s, issue command "list" on client, it will hang:
virsh # list
4.After 30s, clear iptables rule on client:
# iptables -F
5.Wait several minutes, "list" will return without error on client, but doesn't list running domain on server:
virsh # list
 Id    Name                           State
----------------------------------------------------

6."list" again, it will reconnect to the server and list the running domain on server.
virsh # list
root.6.6's password: 
error: Reconnected to the hypervisor
 Id    Name                           State
----------------------------------------------------
 2     r71                            running


Scenario 2:
1.Connect to server on client:
# virsh -k0 -K0 -c qemu+ssh://10.66.6.6/system
root.6.6's password: 
Welcome to virsh, the virtualization interactive terminal.

Type:  'help' for help with commands
       'quit' to quit

virsh # 
2.Set iptables rule on client：
# iptables -A INPUT -s 10.66.6.6 -j DROP
3.After 30s, issue command "list" on client, it will hang:
virsh # list
4.Then clear iptables rule on client:
# iptables -F
5.Wait several minutes, "list" will return with error:
# list
error: Failed to list domains
error: Cannot recv data: Ncat: Broken pipe.: Connection reset by peer
6."list" again, it will reconnect to the server and list the running domain on server.
virsh # list
root.6.6's password: 
error: Reconnected to the hypervisor
 Id    Name                           State
----------------------------------------------------
 2     r71                            running

Comment 14 Martin Kletzander 2015-07-07 08:20:52 UTC

Well, what you are asking has always been the case (although point 5 in scenario 1 is really weird).  That's why we added the support for keepalive that you are turning off using '-k0' (following '-K0' has no meaning afterwards).

And you want virsh to connect when you issue a "list" command, that's expected, the client must reconnect at some point for it to do the command you wanted.

Comment 15 Fangge Jin 2015-07-20 08:13:10 UTC

Hello Martin,

I found that if I set keepalive_required = 1, virsh client always failed to connect to the hypervisor.

# virsh -k4 -K5 -c qemu+tcp://10.66.6.6/system
error: failed to connect to the hypervisor
error: operation failed: keepalive support is required to connect


In the libvirtd.conf file, it said:
# If set to 1, libvirtd will refuse to talk to clients that do not
# support keepalive protocol.  Defaults to 0.
#
keepalive_required = 1

I thought virsh client does support keepalive protocol when I use "virsh -k4 -K5", but why I still can't connect to the hypervisor?


Thanks!
fjin

Comment 16 Martin Kletzander 2015-07-20 12:58:13 UTC

Looking at the source that option *NEVER* worked.  Setting that option would effectively disable incoming connections from *all* clients.  I wouldn't stop this BZ because bug in the daemon.  Thanks for finding that out, it's a wonder that nobody tried using that option since 2011 when it was introduced.

Comment 17 Martin Kletzander 2015-07-20 14:44:06 UTC

Also, I think that option should not allow connecting from clients that do not support _replying_ for server keepalives, not for those that do not request client keepalives.

Comment 18 Fangge Jin 2015-07-22 09:05:22 UTC

Two questions:
1.When virsh client keepalive times out, should the command "virsh migrate" exit directly or still hangs?  
2.When virsh client keepalive times out, it output the error message that from libvirt api, do we allow such error message displayed to user?

My steps are as below:

1.[root@fjin-4-141 ~]# virsh -k2 -K20 migrate mig1 qemu+ssh://10.66.6.6/system --verbose
Migration: [ 1 %]

2.Then break down the network connection between virsh client and server, and wait (20+1)*2=42s， an error message is output, and virsh hangs

2015-07-21 03:21:48.788+0000: 4609: info : libvirt version: 1.2.17, package: 2.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-07-10-07:33:51, x86-035.build.eng.bos.redhat.com)
2015-07-21 03:21:48.788+0000: 4609: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f380e6c1f40 after 20 keepalive messages in 42 seconds
Migration: [ 1 %]

3.Then recover the network connection. Virsh will return:

Migration: [ 1 %]error: operation failed: migration job: unexpectedly failed
[root@fjin-4-141 ~]#

Comment 19 Fangge Jin 2015-07-22 09:07:40 UTC

(In reply to Martin Kletzander from comment #16)
> Looking at the source that option *NEVER* worked.  Setting that option would
> effectively disable incoming connections from *all* clients.  I wouldn't
> stop this BZ because bug in the daemon.  Thanks for finding that out, it's a
> wonder that nobody tried using that option since 2011 when it was introduced.

Will you fix this under this BZ?

Comment 20 Martin Kletzander 2015-07-22 09:23:59 UTC

(In reply to JinFangge from comment #19)
I don't think this needs that immediate fixing.  Nobody used that since it was introduced 4 years ago.  So I wouldn't want this BZ to be stuck just because of such tine code removal.  You can create another one if you want, I already sent the patch for that upstream:

https://www.redhat.com/archives/libvir-list/2015-July/msg00738.html

Comment 21 Martin Kletzander 2015-07-22 12:39:28 UTC

(In reply to JinFangge from comment #18)
Could you check where the virsh client is stuck?  Running the following command (after you installed libvirt-debuginfo) should be enough if you don't have any other virsh running:

  gdb virsh $(pidof virsh) <<< "t a a bt"

Please paste the output into attachment for this BZ.

Comment 22 Fangge Jin 2015-07-23 06:03:08 UTC

Created attachment 1055165 [details]
gdb backtrace output

Comment 23 Martin Kletzander 2015-07-31 13:08:01 UTC

The gdb output is captured after virsh printed out the message about being disconnected due to keepalive?

Comment 24 Fangge Jin 2015-07-31 13:11:57 UTC

(In reply to Martin Kletzander from comment #23)
> The gdb output is captured after virsh printed out the message about being
> disconnected due to keepalive?

Yes.

Comment 25 Fangge Jin 2015-08-21 11:42:37 UTC

Hello Martin, how is it going with this bug (comment #18)?

And we I noticed there are new settings in libvirtd.conf for admin interface, do we need to test these and how?

Comment 26 Martin Kletzander 2015-08-21 18:38:49 UTC

I still have to go through the stuff in comment #18.  The admin-related values are not something you can use and you don't need to care about those for now.

Comment 27 Martin Kletzander 2015-08-21 18:42:14 UTC

Actually, thinking about it, I think the particular problem in comment #18 coule be extrapolated into a new Bug as it only happens during migration and the migration itself is not affected by the fact that virsh does not disconnect.  and also user sees what happened.  Could you create a new one for this particuler case?  Thank you very much.

Comment 28 Fangge Jin 2015-08-26 03:12:55 UTC

One more small question, the error messages below seem to come from the libvirt API, are they allowed to displayed to user essentially?

# virsh -k2 -K20 migrate mig1 qemu+ssh://10.66.6.6/system --verbose
Migration: [ 1 %]2015-07-21 03:21:48.788+0000: 4609: info : libvirt version: 1.2.17, package: 2.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-07-10-07:33:51, x86-035.build.eng.bos.redhat.com)
2015-07-21 03:21:48.788+0000: 4609: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f380e6c1f40 after 20 keepalive messages in 42 seconds
Migration: [ 1 %]

Comment 29 Fangge Jin 2015-08-28 04:51:06 UTC

(In reply to Martin Kletzander from comment #27)
> Actually, thinking about it, I think the particular problem in comment #18
> coule be extrapolated into a new Bug as it only happens during migration and
> the migration itself is not affected by the fact that virsh does not
> disconnect.  and also user sees what happened.  Could you create a new one
> for this particuler case?  Thank you very much.

Created new bug for this: bug 1256213

Comment 30 Fangge Jin 2015-09-01 09:58:57 UTC

Verify this new feature on build libvirt-client-1.2.17-6.el7.x86_64.

Scenario 1:
To test that virsh can detect that no messages received from libvirtd for k*(K+1)s and close the connection.

Test metrix 1:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+     -k       +       -K       +     libvirtd.conf    +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+  Default(5)  +   Default(6)   +       disable KA     +
+      2       +   Default(6)   +       disable KA     +
+      2       +       3        +       disable KA     +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Steps:
Prepare two hosts: host A(local host), host B(10.66.6.6: remote host)

# virsh -c qemu+ssh://10.66.6.6/system
（# iptables -A INPUT -s 10.66.6.6 -j DROP on local host)

virsh # list
(virsh hangs for 35s, then print the error message and returns)
2015-09-01 08:40:40.423+0000: 28396: info : libvirt version: 1.2.17, package: 6.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-08-21-20:23:32, x86-035.build.eng.bos.redhat.com)
2015-09-01 08:40:40.423+0000: 28396: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f0b95180a50 after 6 keepalive messages in 35 seconds
2015-09-01 08:40:40.423+0000: 28397: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f0b95180a50 after 6 keepalive messages in 35 seconds
error: Failed to list domains
error: internal error: received hangup / error event on socket
virsh #

(# iptables -D INPUT -s 10.66.6.6 -j DROP on local host, then list again, virsh will reconnect to the remote hypervisor)
virsh # list
error: Reconnected to the hypervisor
 Id    Name                           State
----------------------------------------------------
 2     rhel6.6-GUI                    running


Test metrix 2:
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+     -k       +       -K       +     libvirtd.conf    +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+      2       +       0        +       disable KA     +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Steps:
# virsh -k2 -K0 -c qemu+ssh://10.66.6.6/system
Welcome to virsh, the virtualization interactive terminal.

Type:  'help' for help with commands
       'quit' to quit

virsh # (just wait 2s)2015-09-01 09:05:58.643+0000: 29006: info : libvirt version: 1.2.17, package: 6.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-08-21-20:23:32, x86-035.build.eng.bos.redhat.com)
2015-09-01 09:05:58.643+0000: 29006: warning : virKeepAliveTimerInternal:143 : No response from client 0x7fe87d2dda50 after 0 keepalive messages in 2 seconds

virsh # list
error: Reconnected to the hypervisor
 Id    Name                           State
----------------------------------------------------
 2     rhel6.6-GUI                    running

Comment 31 Fangge Jin 2015-09-01 10:00:32 UTC

Scenario 2:
To test that if the network is recovered within k*(K+1)s, the virsh connection will be kept.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+     -k       +       -K       +     libvirtd.conf    +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+      4       +       5        +       disable KA     +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++

# virsh -c qemu+ssh://10.66.6.6/system
（# iptables -A INPUT -s 10.66.6.6 -j DROP on local host)

virsh # list
(virsh hangs, then clear iptables rule within 24s)
 Id    Name                           State
----------------------------------------------------
 2     rhel6.6-GUI                    running

virsh #

Comment 32 Fangge Jin 2015-09-01 10:01:13 UTC

Scenario 3:
To test that when disable the keepalive mechinasm in virsh, virsh will not close the connection.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+     -k       +       -K       +     libvirtd.conf    +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+      0       +       -        +       disable KA     +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++

# virsh -k0 -c qemu+ssh://10.66.6.6/system
(# iptables -A INPUT -s 10.66.6.6 -j DROP on local host)
virsh # list
(virsh hangs, after 1min, clear the iptables rule, wait a while, virsh will return)
 Id    Name                           State
----------------------------------------------------
 2     rhel6.6-GUI                    running

Comment 33 Fangge Jin 2015-09-01 10:03:23 UTC

Scenario 4:
To test that virsh keepalive and libvirtd keepalive can work well at the same time.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+     -k       +       -K       +     libvirtd.conf    +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+      10      +       3        + keepalive_interval=5 +
+              +                + keepalive_count=6    +
++++++++++++++++++++++++++++++++++++++++++++++++++++++++

First, turn on verbose debugging of all libvirt API calls on local host:
# declare -x LIBVIRT_DEBUG=1

# virsh -k10 -K3 -c qemu+ssh://10.66.6.6/system
....
(virsh receives the keepalive request from the remote libvirtd every 5s)
2015-09-01 09:44:06.407+0000: 29659: debug : virKeepAliveCheckMessage:398 : Got keepalive request from client 0x7ff240ac8e30
2015-09-01 09:44:06.407+0000: 29659: debug : virKeepAliveMessage:104 : Sending keepalive response to client 0x7ff240ac8e30
2015-09-01 09:44:11.413+0000: 29659: debug : virKeepAliveCheckMessage:398 : Got keepalive request from client 0x7ff240ac8e30
2015-09-01 09:44:11.413+0000: 29659: debug : virKeepAliveMessage:104 : Sending keepalive response to client 0x7ff240ac8e30
2015-09-01 09:44:16.420+0000: 29659: debug : virKeepAliveCheckMessage:398 : Got keepalive request from client 0x7ff240ac8e30
2015-09-01 09:44:16.420+0000: 29659: debug : virKeepAliveMessage:104 : Sending keepalive response to client 0x7ff240ac8e30
...
(# iptables -A INPUT -s 10.66.6.6 -j DROP on local host)
virsh # list
(it hangs for 40s, during which virsh send 3 keepalive request without receiving any response, then returns with error)
2015-09-01 09:44:26.610+0000: 29658: debug : virKeepAliveMessage:104 : Sending keepalive request to client 0x7ff240ac8e30
2015-09-01 09:44:36.610+0000: 29659: debug : virKeepAliveMessage:104 : Sending keepalive request to client 0x7ff240ac8e30
2015-09-01 09:44:46.620+0000: 29659: debug : virKeepAliveMessage:104 : Sending keepalive request to client 0x7ff240ac8e30
2015-09-01 09:44:56.623+0000: 29659: warning : virKeepAliveTimerInternal:143 : No response from client 0x7ff240ac8e30 after 3 keepalive messages in 40 seconds
error: Failed to list domains
error: internal error: received hangup / error event on socket
virsh #

Comment 34 Fangge Jin 2015-09-01 10:07:56 UTC

Scenario 1~4 above can verify this BZ. 

A small issue is recorded in this bug:
Bug 1243684 - Virsh client doesn't print error message when the connection is reset by server on some ocassion.

Comment 35 Martin Kletzander 2015-09-07 14:11:53 UTC

(In reply to JinFangge from comment #28)
To be honest, I did no research how come these pop out even with the 'info'-type output that does not show normally.  It could look better, I agree.

Comment 36 Fangge Jin 2015-09-16 02:09:42 UTC

Hello Martin, the "-k"/"-K" description in "man virsh" doesn't include their default value explanation , could you update it to make it clear?

  -k, --keepalive-interval INTERVAL
    Set an INTERVAL (in seconds) for sending keepalive messages to check whether connection to the server is still alive.  Setting the interval to 0 disables client keepalive mechanism.

  -K, --keepalive-count COUNT
    Set a number of times keepalive message can be sent without getting an answer from the server without marking the connection dead.  There is no effect to this setting in case the INTERVAL is set to 0.

Comment 37 Fangge Jin 2015-09-16 02:42:53 UTC

(In reply to Martin Kletzander from comment #35)
> (In reply to JinFangge from comment #28)
> To be honest, I did no research how come these pop out even with the
> 'info'-type output that does not show normally.  It could look better, I
> agree.

When I set "LIBVIRT_DEBUG=4", these messages will not appear.

Comment 38 Martin Kletzander 2015-09-16 08:19:21 UTC

The defaults should not be of any concern to the end user and they might change at any point in future.  Users should be only concerned about setting that to the value they need (in case the default doesn't fit them, which they will see) or turning it off.

About the messages appearing, that will be dealt with in the second BZ that also deals proper handling of API disconnections, currently assigned to jdenemar.

Comment 40 errata-xmlrpc 2015-11-19 05:45:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-2202.html