Description of problem: stop glusterd in one of the node in the cluster. Once glusterd is stopped run hosted engine vm status command in any of the node. It displays the following python exception. Traceback (most recent call last): File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 117, in <module> if not status_checker.print_status(): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 60, in print_status all_host_stats = ha_cli.get_all_host_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats return self.get_all_stats(self.StatModes.HOST) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats self._configure_broker_conn(broker) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn dom_type=dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain .format(sd_type, options, e)) ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'glusterfs', 'sd_uuid': '537b7b03-7ec7-4fd3-bcf5-000d7a5d162f'}: Request failed: <class 'ovirt_hosted_engine_ha.lib.storage_backends.BackendFailureException'> Version-Release number of selected component (if applicable): ovirt-hosted-engine-setup-1.3.3.4-1.el7ev.noarch How reproducible: Always Steps to Reproduce: 1. Perform Hyper convergence setup by following the grafton installation doc. 2. Now bring down glusterd in one of the node. 3. Run the command 'hosted-engine --vm-status' Actual results: hosted-engine vm-status command displays python exception. Expected results: hosted engine vm-status command should nod displays any exceptions / back traces. Additional info:
please find the sos reports in the link given below. http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1319631/
Does this happen on non converged setups?
I am not sure about non converged setups.
This is in general an issue since ha-agent basically expect that the storage is up while the single hypervisor host could be down. So basically hosted-engine --vm-status reports about the status of the VM and the status of each involved host. In an hyper-converged scenario we should probably incluse also some specific information about the status of the gluster volume and bricks used by hosted-engine since they are on the same hosts we are reporting about.
You deployed hosted-engine creating its storage domain on rhsqa1.lab.eng.blr.redhat.com:/engine We know that there is a SPOF issue on the gluster entry point (see https://bugzilla.redhat.com/1298693 ) so ovirt-ha-agent can fail on the two remaining hosts when you bring down rhsqa1 Here the issue in VDSM logs: Thread-806237::INFO::2016-03-21 14:23:59,690::logUtils::48::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=7, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'id': '7926ce79-7846-466a-aa13-5296272b1d24', 'vfs_type': 'glusterfs', 'connection': 'rhsqa1.lab.eng.blr.redhat.com:/engine', 'user': 'kvm'}], options=None) Thread-806237::ERROR::2016-03-21 14:23:59,794::hsm::2473::Storage.HSM::(connectStorageServer) Could not connect to storageServer Traceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 2470, in connectStorageServer conObj.connect() File "/usr/share/vdsm/storage/storageServer.py", line 223, in connect self.validate() File "/usr/share/vdsm/storage/storageServer.py", line 348, in validate replicaCount = self.volinfo['replicaCount'] File "/usr/share/vdsm/storage/storageServer.py", line 335, in volinfo self._volinfo = self._get_gluster_volinfo() File "/usr/share/vdsm/storage/storageServer.py", line 372, in _get_gluster_volinfo self._volfileserver) File "/usr/share/vdsm/supervdsm.py", line 50, in __call__ return callMethod() File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda> **kwargs) File "<string>", line 2, in glusterVolumeInfo File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod raise convert_to_error(kind, result) GlusterCmdExecFailedException: Command execution failed error: Connection failed. Please check if gluster daemon is operational. return code: 1 Thread-806237::DEBUG::2016-03-21 14:23:59,795::hsm::2497::Storage.HSM::(connectStorageServer) knownSDs: {537b7b03-7ec7-4fd3-bcf5-000d7a5d162f: storage.glusterSD.findDomain, a39b189a-f7cc-4179-8011-472b39ba3350: storage.glusterSD.findDomain, c93b9e34-2b5f-4a92-9301-70a01f9de94d: storage.glusterSD.findDomain} Thread-806237::INFO::2016-03-21 14:23:59,795::logUtils::51::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 4105, 'id': '7926ce79-7846-466a-aa13-5296272b1d24'}]} (In reply to RamaKasturi from comment #0) > stop glusterd in one of the node in the cluster. Which one did you stopped, rhsqa1? if so this is basically just a duplicate of 1298693 with something to cleanup on the output of 'hosted-engine --vm-status' when the storage is not available.
(In reply to Simone Tiraboschi from comment #5) > You deployed hosted-engine creating its storage domain on > rhsqa1.lab.eng.blr.redhat.com:/engine > > We know that there is a SPOF issue on the gluster entry point (see > https://bugzilla.redhat.com/1298693 ) so ovirt-ha-agent can fail on the two > remaining hosts when you bring down rhsqa1 > > Here the issue in VDSM logs: > > Thread-806237::INFO::2016-03-21 > 14:23:59,690::logUtils::48::dispatcher::(wrapper) Run and protect: > connectStorageServer(domType=7, > spUUID='00000000-0000-0000-0000-000000000000', conList=[{'id': > '7926ce79-7846-466a-aa13-5296272b1d24', 'vfs_type': 'glusterfs', > 'connection': 'rhsqa1.lab.eng.blr.redhat.com:/engine', 'user': 'kvm'}], > options=None) > Thread-806237::ERROR::2016-03-21 > 14:23:59,794::hsm::2473::Storage.HSM::(connectStorageServer) Could not > connect to storageServer > Traceback (most recent call last): > File "/usr/share/vdsm/storage/hsm.py", line 2470, in connectStorageServer > conObj.connect() > File "/usr/share/vdsm/storage/storageServer.py", line 223, in connect > self.validate() > File "/usr/share/vdsm/storage/storageServer.py", line 348, in validate > replicaCount = self.volinfo['replicaCount'] > File "/usr/share/vdsm/storage/storageServer.py", line 335, in volinfo > self._volinfo = self._get_gluster_volinfo() > File "/usr/share/vdsm/storage/storageServer.py", line 372, in > _get_gluster_volinfo > self._volfileserver) > File "/usr/share/vdsm/supervdsm.py", line 50, in __call__ > return callMethod() > File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda> > **kwargs) > File "<string>", line 2, in glusterVolumeInfo > File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in > _callmethod > raise convert_to_error(kind, result) > GlusterCmdExecFailedException: Command execution failed > error: Connection failed. Please check if gluster daemon is operational. > return code: 1 > Thread-806237::DEBUG::2016-03-21 > 14:23:59,795::hsm::2497::Storage.HSM::(connectStorageServer) knownSDs: > {537b7b03-7ec7-4fd3-bcf5-000d7a5d162f: storage.glusterSD.findDomain, > a39b189a-f7cc-4179-8011-472b39ba3350: storage.glusterSD.findDomain, > c93b9e34-2b5f-4a92-9301-70a01f9de94d: storage.glusterSD.findDomain} > Thread-806237::INFO::2016-03-21 > 14:23:59,795::logUtils::51::dispatcher::(wrapper) Run and protect: > connectStorageServer, Return response: {'statuslist': [{'status': 4105, > 'id': '7926ce79-7846-466a-aa13-5296272b1d24'}]} > > > (In reply to RamaKasturi from comment #0) > > stop glusterd in one of the node in the cluster. > > Which one did you stopped, rhsqa1? if so this is basically just a duplicate > of 1298693 with something to cleanup on the output of 'hosted-engine > --vm-status' when the storage is not available. yes, i have stopped it on rhsqa1
OK, so let's close this one as a duplicate of 1298693 and let's open a new RFE to have hosted-engine --vm-status showing also the storage status on hyperconverged setup. *** This bug has been marked as a duplicate of bug 1298693 ***