Bug 1319631

Summary: [HC] hosted-engine --vm-status displays python exception when glusterd is down in one of the node
Product: [oVirt] ovirt-hosted-engine-ha Reporter: RamaKasturi <knarra>
Component: BrokerAssignee: Martin Sivák <msivak>
Status: CLOSED DUPLICATE QA Contact: Ilanit Stein <istein>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 1.3.4.3CC: bugs, knarra, sabose, sasundar, sbonazzo, stirabos, ylavi
Target Milestone: ---Flags: sabose: ovirt-3.6.z?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-03-29 08:50:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Gluster RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1258386    

Description RamaKasturi 2016-03-21 08:46:58 UTC
Description of problem:
stop glusterd in one of the node in the cluster. Once glusterd is stopped run hosted engine vm status command in any of the node. It displays the following python exception.

Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 117, in <module>
    if not status_checker.print_status():
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 60, in print_status
    all_host_stats = ha_cli.get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats
    return self.get_all_stats(self.StatModes.HOST)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats
    self._configure_broker_conn(broker)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn
    dom_type=dom_type)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain
    .format(sd_type, options, e))
ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'glusterfs', 'sd_uuid': '537b7b03-7ec7-4fd3-bcf5-000d7a5d162f'}: Request failed: <class 'ovirt_hosted_engine_ha.lib.storage_backends.BackendFailureException'>


Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup-1.3.3.4-1.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Perform Hyper convergence setup by following the grafton installation doc.
2. Now bring down glusterd in one of the node.
3. Run the command 'hosted-engine --vm-status'

Actual results:
hosted-engine vm-status command displays python exception.

Expected results:
hosted engine vm-status command should nod displays any exceptions / back traces.

Additional info:

Comment 1 RamaKasturi 2016-03-21 09:32:36 UTC
please find the sos reports in the link given below.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1319631/

Comment 2 Yaniv Lavi 2016-03-23 10:54:56 UTC
Does this happen on non converged setups?

Comment 3 RamaKasturi 2016-03-24 06:33:29 UTC
I am not sure about non converged setups.

Comment 4 Simone Tiraboschi 2016-03-24 15:19:24 UTC
This is in general an issue since ha-agent basically expect that the storage is up while the single hypervisor host could be down.
So basically hosted-engine --vm-status reports about the status of the VM and the status of each involved host.
In an hyper-converged scenario we should probably incluse also some specific information about the status of the gluster volume and bricks used by hosted-engine since they are on the same hosts we are reporting about.

Comment 5 Simone Tiraboschi 2016-03-24 16:44:04 UTC
You deployed hosted-engine creating its storage domain on rhsqa1.lab.eng.blr.redhat.com:/engine

We know that there is a SPOF issue on the gluster entry point (see https://bugzilla.redhat.com/1298693 ) so ovirt-ha-agent can fail on the two remaining hosts when you bring down rhsqa1 

Here the issue in VDSM logs:

Thread-806237::INFO::2016-03-21 14:23:59,690::logUtils::48::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=7, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'id': '7926ce79-7846-466a-aa13-5296272b1d24', 'vfs_type': 'glusterfs', 'connection': 'rhsqa1.lab.eng.blr.redhat.com:/engine', 'user': 'kvm'}], options=None)
Thread-806237::ERROR::2016-03-21 14:23:59,794::hsm::2473::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2470, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 223, in connect
    self.validate()
  File "/usr/share/vdsm/storage/storageServer.py", line 348, in validate
    replicaCount = self.volinfo['replicaCount']
  File "/usr/share/vdsm/storage/storageServer.py", line 335, in volinfo
    self._volinfo = self._get_gluster_volinfo()
  File "/usr/share/vdsm/storage/storageServer.py", line 372, in _get_gluster_volinfo
    self._volfileserver)
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
    **kwargs)
  File "<string>", line 2, in glusterVolumeInfo
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
    raise convert_to_error(kind, result)
GlusterCmdExecFailedException: Command execution failed
error: Connection failed. Please check if gluster daemon is operational.
return code: 1
Thread-806237::DEBUG::2016-03-21 14:23:59,795::hsm::2497::Storage.HSM::(connectStorageServer) knownSDs: {537b7b03-7ec7-4fd3-bcf5-000d7a5d162f: storage.glusterSD.findDomain, a39b189a-f7cc-4179-8011-472b39ba3350: storage.glusterSD.findDomain, c93b9e34-2b5f-4a92-9301-70a01f9de94d: storage.glusterSD.findDomain}
Thread-806237::INFO::2016-03-21 14:23:59,795::logUtils::51::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 4105, 'id': '7926ce79-7846-466a-aa13-5296272b1d24'}]}


(In reply to RamaKasturi from comment #0)
> stop glusterd in one of the node in the cluster. 

Which one did you stopped, rhsqa1? if so this is basically just a duplicate of 1298693 with something to cleanup on the output of 'hosted-engine --vm-status' when the storage is not available.

Comment 6 RamaKasturi 2016-03-28 18:15:51 UTC
(In reply to Simone Tiraboschi from comment #5)
> You deployed hosted-engine creating its storage domain on
> rhsqa1.lab.eng.blr.redhat.com:/engine
> 
> We know that there is a SPOF issue on the gluster entry point (see
> https://bugzilla.redhat.com/1298693 ) so ovirt-ha-agent can fail on the two
> remaining hosts when you bring down rhsqa1 
> 
> Here the issue in VDSM logs:
> 
> Thread-806237::INFO::2016-03-21
> 14:23:59,690::logUtils::48::dispatcher::(wrapper) Run and protect:
> connectStorageServer(domType=7,
> spUUID='00000000-0000-0000-0000-000000000000', conList=[{'id':
> '7926ce79-7846-466a-aa13-5296272b1d24', 'vfs_type': 'glusterfs',
> 'connection': 'rhsqa1.lab.eng.blr.redhat.com:/engine', 'user': 'kvm'}],
> options=None)
> Thread-806237::ERROR::2016-03-21
> 14:23:59,794::hsm::2473::Storage.HSM::(connectStorageServer) Could not
> connect to storageServer
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/hsm.py", line 2470, in connectStorageServer
>     conObj.connect()
>   File "/usr/share/vdsm/storage/storageServer.py", line 223, in connect
>     self.validate()
>   File "/usr/share/vdsm/storage/storageServer.py", line 348, in validate
>     replicaCount = self.volinfo['replicaCount']
>   File "/usr/share/vdsm/storage/storageServer.py", line 335, in volinfo
>     self._volinfo = self._get_gluster_volinfo()
>   File "/usr/share/vdsm/storage/storageServer.py", line 372, in
> _get_gluster_volinfo
>     self._volfileserver)
>   File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
>     return callMethod()
>   File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
>     **kwargs)
>   File "<string>", line 2, in glusterVolumeInfo
>   File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in
> _callmethod
>     raise convert_to_error(kind, result)
> GlusterCmdExecFailedException: Command execution failed
> error: Connection failed. Please check if gluster daemon is operational.
> return code: 1
> Thread-806237::DEBUG::2016-03-21
> 14:23:59,795::hsm::2497::Storage.HSM::(connectStorageServer) knownSDs:
> {537b7b03-7ec7-4fd3-bcf5-000d7a5d162f: storage.glusterSD.findDomain,
> a39b189a-f7cc-4179-8011-472b39ba3350: storage.glusterSD.findDomain,
> c93b9e34-2b5f-4a92-9301-70a01f9de94d: storage.glusterSD.findDomain}
> Thread-806237::INFO::2016-03-21
> 14:23:59,795::logUtils::51::dispatcher::(wrapper) Run and protect:
> connectStorageServer, Return response: {'statuslist': [{'status': 4105,
> 'id': '7926ce79-7846-466a-aa13-5296272b1d24'}]}
> 
> 
> (In reply to RamaKasturi from comment #0)
> > stop glusterd in one of the node in the cluster. 
> 
> Which one did you stopped, rhsqa1? if so this is basically just a duplicate
> of 1298693 with something to cleanup on the output of 'hosted-engine
> --vm-status' when the storage is not available.

yes, i have stopped it on rhsqa1

Comment 7 Simone Tiraboschi 2016-03-29 08:50:55 UTC
OK, so let's close this one as a duplicate of 1298693 and let's open a new RFE to have hosted-engine --vm-status showing also the storage status on hyperconverged setup.

*** This bug has been marked as a duplicate of bug 1298693 ***