Bug 1208489 - HE active hyper-visor not responding to "hosted-engine --vm-status" after "iptables -I INPUT -s 10.35.160.108 -j DROP" cast.
Summary: HE active hyper-visor not responding to "hosted-engine --vm-status" after "ip...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-hosted-engine-ha
Version: 3.5.1
Hardware: x86_64
OS: Linux
low
urgent
Target Milestone: ovirt-3.6.0-rc
: 3.6.0
Assignee: Dudi Maroshi
QA Contact: Nikolai Sednev
URL:
Whiteboard:
: 1085523 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-04-02 11:43 UTC by Nikolai Sednev
Modified: 2016-03-09 19:49 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
When a self-hosted engine host client requested status from the Manager virtual machine (hosted-engine --vm-status) and a connection to the storage domain could not be established, the client hanged indefinitely waiting for a response from the ovirt-ha-broker. With this update, connection timeout is added and if the storage domain cannot be accessed, an appropriate error message is returned.
Clone Of:
Environment:
Last Closed: 2016-03-09 19:49:03 UTC
oVirt Team: SLA
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
all logs (3.62 MB, application/x-gzip)
2015-04-02 12:06 UTC, Nikolai Sednev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2016:0422 0 normal SHIPPED_LIVE ovirt-hosted-engine-ha bug fix and enhancement update 2016-03-09 23:58:25 UTC
oVirt gerrit 40392 0 master MERGED hosted-engine: hosted-engine client, with storage connection timeout Never

Description Nikolai Sednev 2015-04-02 11:43:18 UTC
Description of problem:
HE active hyper-visor not responding to "hosted-engine --vm-status" after "iptables -I INPUT -s 10.35.160.108 -j DROP" cast.

Version-Release number of selected component (if applicable):
RHEVH6.6 20150304.0.el6ev:
sanlock-2.8-1.el6.x86_64
ovirt-node-selinux-3.2.1-9.el6.noarch
ovirt-host-deploy-offline-1.3.0-3.el6ev.x86_64
ovirt-node-plugin-vdsm-0.2.0-19.el6ev.noarch
ovirt-host-deploy-1.3.0-2.el6ev.noarch
ovirt-node-plugin-rhn-3.2.1-9.el6.noarch
ovirt-node-3.2.1-9.el6.noarch
vdsm-4.16.8.1-7.el6ev.x86_64
ovirt-hosted-engine-ha-1.2.5-1.el6ev.noarch
ovirt-node-plugin-hosted-engine-0.2.0-9.0.el6ev.x86_64
ovirt-node-plugin-cim-3.2.1-9.el6.noarch
ovirt-node-branding-rhev-3.2.1-9.el6.noarch
libvirt-0.10.2-46.el6_6.3.x86_64
qemu-kvm-rhev-0.12.1.2-2.446.el6.x86_64
ovirt-hosted-engine-setup-1.2.2-1.el6ev.noarch
ovirt-node-plugin-snmp-3.2.1-9.el6.noarch

Engine RHEL6.6
rhevm-guest-agent-common-1.0.10-2.el6ev.noarch
rhevm-3.5.1-0.2.el6ev.noarch
How reproducible:


Steps to Reproduce:
1.Assemble setup of two RHEVHs with NFS SD for HE only.
2.Cast on active hyper-visor "iptables -I INPUT -s 10.35.160.108 -j DROP", IP here is your SD IP.
3.Run "hosted-engine --vm-status" on host and see that its stuck.

Actual results:
HE being shifted to second host, that is expected, but "hosted-engine --vm-status" stuck and not replying anything on initially active host.

Expected results:
"hosted-engine --vm-status" should reply with the results.

Additional info:
logs from both hosts and engine.

Comment 1 Nikolai Sednev 2015-04-02 12:06:36 UTC
Created attachment 1010128 [details]
all logs

Comment 3 Doron Fediuck 2015-04-14 16:37:04 UTC
The status verb is reading current stats from the storage.
If storage is blocked then the utility is waiting for it.

We can add a timeout for the utility. If expired the utility will say
it cannot access the shared storage.

Comment 4 Roy Golan 2015-04-26 13:23:46 UTC
just a note on the reproducer, it doesn't need to be 2 host setup. 1 host and an IPTABLES rule to drop the packets will suffice.

Comment 5 Nikolai Sednev 2015-04-29 10:59:13 UTC
(In reply to Roy Golan from comment #4)
> just a note on the reproducer, it doesn't need to be 2 host setup. 1 host
> and an IPTABLES rule to drop the packets will suffice.

Yep, it's known and also was tested with a single host, but second host was required to check that HA passes the HE-VM properly.

Comment 6 Dudi Maroshi 2015-04-29 11:02:19 UTC
Problem reconstructed and repeated.

Diagnostics:
------------
When a hosted engine client is requesting status from hosted engine and there is
no connection to storage domain. The client is hanging indefinitely, waiting
for response from the hosted-engine-broker.

Solution:
---------
The fix is add timeout for calling storage domain information.
In lib.brokerlink.set_storage_domain

Comment 7 Doron Fediuck 2015-08-25 10:36:18 UTC
*** Bug 1085523 has been marked as a duplicate of this bug. ***

Comment 9 Nikolai Sednev 2015-11-05 16:59:20 UTC
Now previously active host isn't stuck after casting the "hosted-engine --vm-status" command on it, but takes a few seconds to response:
# hosted-engine --vm-status
Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 117, in <module>
    if not status_checker.print_status():
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 60, in print_status
    all_host_stats = ha_cli.get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats
    return self.get_all_stats(self.StatModes.HOST)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats
    self._configure_broker_conn(broker)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn
    dom_type=dom_type)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain
    .format(sd_type, options, e))
ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'iscsi', 'sd_uuid': 'df2356f7-8272-401a-97f7-63c14f37ec7a'}: Connection timed out

After releasing the iptables rule, host responds correctly:
# hosted-engine --vm-status


--== Host 1 status ==--

Status up-to-date                  : True
Hostname                           : alma03.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : d69bf92a
Host timestamp                     : 5992


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : alma04.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 3d75f9e9
Host timestamp                     : 3790
You have new mail in /var/spool/mail/root

Comment 11 errata-xmlrpc 2016-03-09 19:49:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-0422.html


Note You need to log in before you can comment on or make changes to this bug.