Description of problem: HE active hyper-visor not responding to "hosted-engine --vm-status" after "iptables -I INPUT -s 10.35.160.108 -j DROP" cast. Version-Release number of selected component (if applicable): RHEVH6.6 20150304.0.el6ev: sanlock-2.8-1.el6.x86_64 ovirt-node-selinux-3.2.1-9.el6.noarch ovirt-host-deploy-offline-1.3.0-3.el6ev.x86_64 ovirt-node-plugin-vdsm-0.2.0-19.el6ev.noarch ovirt-host-deploy-1.3.0-2.el6ev.noarch ovirt-node-plugin-rhn-3.2.1-9.el6.noarch ovirt-node-3.2.1-9.el6.noarch vdsm-4.16.8.1-7.el6ev.x86_64 ovirt-hosted-engine-ha-1.2.5-1.el6ev.noarch ovirt-node-plugin-hosted-engine-0.2.0-9.0.el6ev.x86_64 ovirt-node-plugin-cim-3.2.1-9.el6.noarch ovirt-node-branding-rhev-3.2.1-9.el6.noarch libvirt-0.10.2-46.el6_6.3.x86_64 qemu-kvm-rhev-0.12.1.2-2.446.el6.x86_64 ovirt-hosted-engine-setup-1.2.2-1.el6ev.noarch ovirt-node-plugin-snmp-3.2.1-9.el6.noarch Engine RHEL6.6 rhevm-guest-agent-common-1.0.10-2.el6ev.noarch rhevm-3.5.1-0.2.el6ev.noarch How reproducible: Steps to Reproduce: 1.Assemble setup of two RHEVHs with NFS SD for HE only. 2.Cast on active hyper-visor "iptables -I INPUT -s 10.35.160.108 -j DROP", IP here is your SD IP. 3.Run "hosted-engine --vm-status" on host and see that its stuck. Actual results: HE being shifted to second host, that is expected, but "hosted-engine --vm-status" stuck and not replying anything on initially active host. Expected results: "hosted-engine --vm-status" should reply with the results. Additional info: logs from both hosts and engine.
Created attachment 1010128 [details] all logs
The status verb is reading current stats from the storage. If storage is blocked then the utility is waiting for it. We can add a timeout for the utility. If expired the utility will say it cannot access the shared storage.
just a note on the reproducer, it doesn't need to be 2 host setup. 1 host and an IPTABLES rule to drop the packets will suffice.
(In reply to Roy Golan from comment #4) > just a note on the reproducer, it doesn't need to be 2 host setup. 1 host > and an IPTABLES rule to drop the packets will suffice. Yep, it's known and also was tested with a single host, but second host was required to check that HA passes the HE-VM properly.
Problem reconstructed and repeated. Diagnostics: ------------ When a hosted engine client is requesting status from hosted engine and there is no connection to storage domain. The client is hanging indefinitely, waiting for response from the hosted-engine-broker. Solution: --------- The fix is add timeout for calling storage domain information. In lib.brokerlink.set_storage_domain
*** Bug 1085523 has been marked as a duplicate of this bug. ***
Now previously active host isn't stuck after casting the "hosted-engine --vm-status" command on it, but takes a few seconds to response: # hosted-engine --vm-status Traceback (most recent call last): File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 117, in <module> if not status_checker.print_status(): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 60, in print_status all_host_stats = ha_cli.get_all_host_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats return self.get_all_stats(self.StatModes.HOST) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats self._configure_broker_conn(broker) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn dom_type=dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain .format(sd_type, options, e)) ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'iscsi', 'sd_uuid': 'df2356f7-8272-401a-97f7-63c14f37ec7a'}: Connection timed out After releasing the iptables rule, host responds correctly: # hosted-engine --vm-status --== Host 1 status ==-- Status up-to-date : True Hostname : alma03.qa.lab.tlv.redhat.com Host ID : 1 Engine status : {"health": "good", "vm": "up", "detail": "up"} Score : 3400 stopped : False Local maintenance : False crc32 : d69bf92a Host timestamp : 5992 --== Host 2 status ==-- Status up-to-date : True Hostname : alma04.qa.lab.tlv.redhat.com Host ID : 2 Engine status : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"} Score : 3400 stopped : False Local maintenance : False crc32 : 3d75f9e9 Host timestamp : 3790 You have new mail in /var/spool/mail/root
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-0422.html