Bug 1073506
Summary: | [RFE] Add keepalive support into virsh | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Martin Kletzander <mkletzan> | ||||
Component: | libvirt | Assignee: | Martin Kletzander <mkletzan> | ||||
Status: | CLOSED ERRATA | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
Severity: | low | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 7.0 | CC: | cwei, dyuan, fjin, gsun, herrold, jdenemar, juzhou, leiwan, lhuang, lmiksik, mkletzan, mzhan, rbalakri, veillard, weizhan, zhwang, zpeng | ||||
Target Milestone: | rc | Keywords: | FutureFeature, Upstream | ||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | libvirt-1.2.13-1.el7 | Doc Type: | Enhancement | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | 822839 | Environment: | |||||
Last Closed: | 2015-11-19 05:45:07 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Martin Kletzander
2014-03-06 14:37:40 UTC
Patch proposed upstream: https://www.redhat.com/archives/libvir-list/2014-March/msg00415.html Fixed upstream with commit v1.2.2-201-g676cb4f: commit 676cb4f4e762b8682a06c6dab1f690fbcd939550 Author: Martin Kletzander <mkletzan> Date: Thu Mar 6 17:20:11 2014 +0100 virsh: Add keepalive in new vshConnect function in series with optional commit v1.2.2-202-gb0cf7d6: commit b0cf7d64614ea1000424534ebbd5738d254c7410 Author: Martin Kletzander <mkletzan> Date: Fri Mar 7 11:15:39 2014 +0100 virsh: Prohibit virConnectOpen* functions in virsh I could reproduce this issue with libvirt-1.1.1-24.el7.x86_64. Steps to Reproduce: 1. Start a domain # virsh start <domain> 2. Start live migration # virsh migrate --live <domain> qemu+ssh://host2/system 4. During the migration, stop network on target host # ip l se dev ens3 down Actual results: virsh hangs until the ssh connection times out I have verified this issue with libvirt-1.2.7-1.el7.x86_64: Here are verify steps: 1. Start a domain # virsh start <domain> 2. Start live migration # virsh -k1 -K5 migrate --live rhel6 qemu+ssh://xx.xx.xx.xx/system --verbose root.xx.xx's password: 4. During the migration, stop network on target host # ip l se dev ens3 down Actual results: Following Info will output: 2014-08-19 08:04:45.244+0000: 14860: info : libvirt version: 1.2.7, package: 1.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2014-08-04-04:57:05, x86-019.build.eng.bos.redhat.com) 2014-08-19 08:04:45.244+0000: 14860: warning : virKeepAliveTimerInternal:143 : No response from client 0x7fb014001320 after 5 keepalive messages in 6 seconds 2014-08-19 08:04:45.244+0000: 14861: warning : virKeepAliveTimerInternal:143 : No response from client 0x7fb014001320 after 5 keepalive messages in 6 seconds error: internal error: received hangup / error event on socket Because the virsh client still hang after keep alive time out although the libvirtd is disconnected, and we think it will be good if the client do not try to reconnect to libvirtd after time out. Seems we need another patch for this issue: https://www.redhat.com/archives/libvir-list/2014-November/msg01080.html So move back to assigned. Thanks, Luyao Huang Fixed upstream by v1.2.10-284-g48abdf5 and v1.2.10-285-gf127138: commit 48abdf5de7dbb81d4091235efc57cf07fd45cb86 Author: Martin Kletzander <mkletzan> Date: Mon Dec 1 11:46:14 2014 +0100 virsh: Don't reconnect after the command when disconnected commit f1271380381bb49b4a4c38950fe77a60d19ea9a3 Author: Martin Kletzander <mkletzan> Date: Sun Nov 30 20:09:08 2014 +0100 rpc: Report proper close reason Hi, Martin, I'm doing some test about libvirtd keepalive mech. I found that the client can reconnect to the server in one session after the network connection is recovered, is this the new design? First on the server, configure: keepalive_interval = 5 keepalive_count = 5 Scenario 1: 1.Connect to server on client: # virsh -k0 -K0 -c qemu+ssh://10.66.6.6/system root.6.6's password: Welcome to virsh, the virtualization interactive terminal. Type: 'help' for help with commands 'quit' to quit virsh # 2.Set iptables rule on client: # iptables -A INPUT -s 10.66.6.6 -j DROP 3.Within 30s, issue command "list" on client, it will hang: virsh # list 4.After 30s, clear iptables rule on client: # iptables -F 5.Wait several minutes, "list" will return without error on client, but doesn't list running domain on server: virsh # list Id Name State ---------------------------------------------------- 6."list" again, it will reconnect to the server and list the running domain on server. virsh # list root.6.6's password: error: Reconnected to the hypervisor Id Name State ---------------------------------------------------- 2 r71 running Scenario 2: 1.Connect to server on client: # virsh -k0 -K0 -c qemu+ssh://10.66.6.6/system root.6.6's password: Welcome to virsh, the virtualization interactive terminal. Type: 'help' for help with commands 'quit' to quit virsh # 2.Set iptables rule on client: # iptables -A INPUT -s 10.66.6.6 -j DROP 3.After 30s, issue command "list" on client, it will hang: virsh # list 4.Then clear iptables rule on client: # iptables -F 5.Wait several minutes, "list" will return with error: # list error: Failed to list domains error: Cannot recv data: Ncat: Broken pipe.: Connection reset by peer 6."list" again, it will reconnect to the server and list the running domain on server. virsh # list root.6.6's password: error: Reconnected to the hypervisor Id Name State ---------------------------------------------------- 2 r71 running Well, what you are asking has always been the case (although point 5 in scenario 1 is really weird). That's why we added the support for keepalive that you are turning off using '-k0' (following '-K0' has no meaning afterwards). And you want virsh to connect when you issue a "list" command, that's expected, the client must reconnect at some point for it to do the command you wanted. Hello Martin, I found that if I set keepalive_required = 1, virsh client always failed to connect to the hypervisor. # virsh -k4 -K5 -c qemu+tcp://10.66.6.6/system error: failed to connect to the hypervisor error: operation failed: keepalive support is required to connect In the libvirtd.conf file, it said: # If set to 1, libvirtd will refuse to talk to clients that do not # support keepalive protocol. Defaults to 0. # keepalive_required = 1 I thought virsh client does support keepalive protocol when I use "virsh -k4 -K5", but why I still can't connect to the hypervisor? Thanks! fjin Looking at the source that option *NEVER* worked. Setting that option would effectively disable incoming connections from *all* clients. I wouldn't stop this BZ because bug in the daemon. Thanks for finding that out, it's a wonder that nobody tried using that option since 2011 when it was introduced. Also, I think that option should not allow connecting from clients that do not support _replying_ for server keepalives, not for those that do not request client keepalives. Two questions: 1.When virsh client keepalive times out, should the command "virsh migrate" exit directly or still hangs? 2.When virsh client keepalive times out, it output the error message that from libvirt api, do we allow such error message displayed to user? My steps are as below: 1.[root@fjin-4-141 ~]# virsh -k2 -K20 migrate mig1 qemu+ssh://10.66.6.6/system --verbose Migration: [ 1 %] 2.Then break down the network connection between virsh client and server, and wait (20+1)*2=42s, an error message is output, and virsh hangs 2015-07-21 03:21:48.788+0000: 4609: info : libvirt version: 1.2.17, package: 2.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-07-10-07:33:51, x86-035.build.eng.bos.redhat.com) 2015-07-21 03:21:48.788+0000: 4609: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f380e6c1f40 after 20 keepalive messages in 42 seconds Migration: [ 1 %] 3.Then recover the network connection. Virsh will return: Migration: [ 1 %]error: operation failed: migration job: unexpectedly failed [root@fjin-4-141 ~]# (In reply to Martin Kletzander from comment #16) > Looking at the source that option *NEVER* worked. Setting that option would > effectively disable incoming connections from *all* clients. I wouldn't > stop this BZ because bug in the daemon. Thanks for finding that out, it's a > wonder that nobody tried using that option since 2011 when it was introduced. Will you fix this under this BZ? (In reply to JinFangge from comment #19) I don't think this needs that immediate fixing. Nobody used that since it was introduced 4 years ago. So I wouldn't want this BZ to be stuck just because of such tine code removal. You can create another one if you want, I already sent the patch for that upstream: https://www.redhat.com/archives/libvir-list/2015-July/msg00738.html (In reply to JinFangge from comment #18) Could you check where the virsh client is stuck? Running the following command (after you installed libvirt-debuginfo) should be enough if you don't have any other virsh running: gdb virsh $(pidof virsh) <<< "t a a bt" Please paste the output into attachment for this BZ. Created attachment 1055165 [details]
gdb backtrace output
The gdb output is captured after virsh printed out the message about being disconnected due to keepalive? (In reply to Martin Kletzander from comment #23) > The gdb output is captured after virsh printed out the message about being > disconnected due to keepalive? Yes. Hello Martin, how is it going with this bug (comment #18)? And we I noticed there are new settings in libvirtd.conf for admin interface, do we need to test these and how? I still have to go through the stuff in comment #18. The admin-related values are not something you can use and you don't need to care about those for now. Actually, thinking about it, I think the particular problem in comment #18 coule be extrapolated into a new Bug as it only happens during migration and the migration itself is not affected by the fact that virsh does not disconnect. and also user sees what happened. Could you create a new one for this particuler case? Thank you very much. One more small question, the error messages below seem to come from the libvirt API, are they allowed to displayed to user essentially? # virsh -k2 -K20 migrate mig1 qemu+ssh://10.66.6.6/system --verbose Migration: [ 1 %]2015-07-21 03:21:48.788+0000: 4609: info : libvirt version: 1.2.17, package: 2.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-07-10-07:33:51, x86-035.build.eng.bos.redhat.com) 2015-07-21 03:21:48.788+0000: 4609: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f380e6c1f40 after 20 keepalive messages in 42 seconds Migration: [ 1 %] (In reply to Martin Kletzander from comment #27) > Actually, thinking about it, I think the particular problem in comment #18 > coule be extrapolated into a new Bug as it only happens during migration and > the migration itself is not affected by the fact that virsh does not > disconnect. and also user sees what happened. Could you create a new one > for this particuler case? Thank you very much. Created new bug for this: bug 1256213 Verify this new feature on build libvirt-client-1.2.17-6.el7.x86_64. Scenario 1: To test that virsh can detect that no messages received from libvirtd for k*(K+1)s and close the connection. Test metrix 1: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + -k + -K + libvirtd.conf + ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + Default(5) + Default(6) + disable KA + + 2 + Default(6) + disable KA + + 2 + 3 + disable KA + ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Steps: Prepare two hosts: host A(local host), host B(10.66.6.6: remote host) # virsh -c qemu+ssh://10.66.6.6/system (# iptables -A INPUT -s 10.66.6.6 -j DROP on local host) virsh # list (virsh hangs for 35s, then print the error message and returns) 2015-09-01 08:40:40.423+0000: 28396: info : libvirt version: 1.2.17, package: 6.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-08-21-20:23:32, x86-035.build.eng.bos.redhat.com) 2015-09-01 08:40:40.423+0000: 28396: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f0b95180a50 after 6 keepalive messages in 35 seconds 2015-09-01 08:40:40.423+0000: 28397: warning : virKeepAliveTimerInternal:143 : No response from client 0x7f0b95180a50 after 6 keepalive messages in 35 seconds error: Failed to list domains error: internal error: received hangup / error event on socket virsh # (# iptables -D INPUT -s 10.66.6.6 -j DROP on local host, then list again, virsh will reconnect to the remote hypervisor) virsh # list error: Reconnected to the hypervisor Id Name State ---------------------------------------------------- 2 rhel6.6-GUI running Test metrix 2: ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + -k + -K + libvirtd.conf + ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + 2 + 0 + disable KA + ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Steps: # virsh -k2 -K0 -c qemu+ssh://10.66.6.6/system Welcome to virsh, the virtualization interactive terminal. Type: 'help' for help with commands 'quit' to quit virsh # (just wait 2s)2015-09-01 09:05:58.643+0000: 29006: info : libvirt version: 1.2.17, package: 6.el7 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-08-21-20:23:32, x86-035.build.eng.bos.redhat.com) 2015-09-01 09:05:58.643+0000: 29006: warning : virKeepAliveTimerInternal:143 : No response from client 0x7fe87d2dda50 after 0 keepalive messages in 2 seconds virsh # list error: Reconnected to the hypervisor Id Name State ---------------------------------------------------- 2 rhel6.6-GUI running Scenario 2: To test that if the network is recovered within k*(K+1)s, the virsh connection will be kept. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + -k + -K + libvirtd.conf + ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + 4 + 5 + disable KA + ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ # virsh -c qemu+ssh://10.66.6.6/system (# iptables -A INPUT -s 10.66.6.6 -j DROP on local host) virsh # list (virsh hangs, then clear iptables rule within 24s) Id Name State ---------------------------------------------------- 2 rhel6.6-GUI running virsh # Scenario 3: To test that when disable the keepalive mechinasm in virsh, virsh will not close the connection. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + -k + -K + libvirtd.conf + ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + 0 + - + disable KA + ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ # virsh -k0 -c qemu+ssh://10.66.6.6/system (# iptables -A INPUT -s 10.66.6.6 -j DROP on local host) virsh # list (virsh hangs, after 1min, clear the iptables rule, wait a while, virsh will return) Id Name State ---------------------------------------------------- 2 rhel6.6-GUI running Scenario 4: To test that virsh keepalive and libvirtd keepalive can work well at the same time. ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + -k + -K + libvirtd.conf + ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ + 10 + 3 + keepalive_interval=5 + + + + keepalive_count=6 + ++++++++++++++++++++++++++++++++++++++++++++++++++++++++ First, turn on verbose debugging of all libvirt API calls on local host: # declare -x LIBVIRT_DEBUG=1 # virsh -k10 -K3 -c qemu+ssh://10.66.6.6/system .... (virsh receives the keepalive request from the remote libvirtd every 5s) 2015-09-01 09:44:06.407+0000: 29659: debug : virKeepAliveCheckMessage:398 : Got keepalive request from client 0x7ff240ac8e30 2015-09-01 09:44:06.407+0000: 29659: debug : virKeepAliveMessage:104 : Sending keepalive response to client 0x7ff240ac8e30 2015-09-01 09:44:11.413+0000: 29659: debug : virKeepAliveCheckMessage:398 : Got keepalive request from client 0x7ff240ac8e30 2015-09-01 09:44:11.413+0000: 29659: debug : virKeepAliveMessage:104 : Sending keepalive response to client 0x7ff240ac8e30 2015-09-01 09:44:16.420+0000: 29659: debug : virKeepAliveCheckMessage:398 : Got keepalive request from client 0x7ff240ac8e30 2015-09-01 09:44:16.420+0000: 29659: debug : virKeepAliveMessage:104 : Sending keepalive response to client 0x7ff240ac8e30 ... (# iptables -A INPUT -s 10.66.6.6 -j DROP on local host) virsh # list (it hangs for 40s, during which virsh send 3 keepalive request without receiving any response, then returns with error) 2015-09-01 09:44:26.610+0000: 29658: debug : virKeepAliveMessage:104 : Sending keepalive request to client 0x7ff240ac8e30 2015-09-01 09:44:36.610+0000: 29659: debug : virKeepAliveMessage:104 : Sending keepalive request to client 0x7ff240ac8e30 2015-09-01 09:44:46.620+0000: 29659: debug : virKeepAliveMessage:104 : Sending keepalive request to client 0x7ff240ac8e30 2015-09-01 09:44:56.623+0000: 29659: warning : virKeepAliveTimerInternal:143 : No response from client 0x7ff240ac8e30 after 3 keepalive messages in 40 seconds error: Failed to list domains error: internal error: received hangup / error event on socket virsh # Scenario 1~4 above can verify this BZ. A small issue is recorded in this bug: Bug 1243684 - Virsh client doesn't print error message when the connection is reset by server on some ocassion. (In reply to JinFangge from comment #28) To be honest, I did no research how come these pop out even with the 'info'-type output that does not show normally. It could look better, I agree. Hello Martin, the "-k"/"-K" description in "man virsh" doesn't include their default value explanation , could you update it to make it clear? -k, --keepalive-interval INTERVAL Set an INTERVAL (in seconds) for sending keepalive messages to check whether connection to the server is still alive. Setting the interval to 0 disables client keepalive mechanism. -K, --keepalive-count COUNT Set a number of times keepalive message can be sent without getting an answer from the server without marking the connection dead. There is no effect to this setting in case the INTERVAL is set to 0. (In reply to Martin Kletzander from comment #35) > (In reply to JinFangge from comment #28) > To be honest, I did no research how come these pop out even with the > 'info'-type output that does not show normally. It could look better, I > agree. When I set "LIBVIRT_DEBUG=4", these messages will not appear. The defaults should not be of any concern to the end user and they might change at any point in future. Users should be only concerned about setting that to the value they need (in case the default doesn't fit them, which they will see) or turning it off. About the messages appearing, that will be dealt with in the second BZ that also deals proper handling of API disconnections, currently assigned to jdenemar. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-2202.html |