Bug 1319631 - [HC] hosted-engine --vm-status displays python exception when glusterd is down in one of the node
Summary: [HC] hosted-engine --vm-status displays python exception when glusterd is dow...
Keywords:
Status: CLOSED DUPLICATE of bug 1298693
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Broker
Version: 1.3.4.3
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: ---
Assignee: Martin Sivák
QA Contact: Ilanit Stein
URL:
Whiteboard:
Depends On:
Blocks: Gluster-HC-1
TreeView+ depends on / blocked
 
Reported: 2016-03-21 08:46 UTC by RamaKasturi
Modified: 2017-05-11 09:25 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-03-29 08:50:55 UTC
oVirt Team: Gluster
Embargoed:
sabose: ovirt-3.6.z?
rule-engine: planning_ack?
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1298693 0 medium CLOSED [RFE] - Single point of failure on entry point server deploying hosted-engine over gluster FS 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1321834 0 medium CLOSED [RFE][HC] on hyper-converged setup --vm-status should show also the storage status 2021-02-22 00:41:40 UTC

Internal Links: 1298693 1321834

Description RamaKasturi 2016-03-21 08:46:58 UTC
Description of problem:
stop glusterd in one of the node in the cluster. Once glusterd is stopped run hosted engine vm status command in any of the node. It displays the following python exception.

Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 117, in <module>
    if not status_checker.print_status():
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/vm_status.py", line 60, in print_status
    all_host_stats = ha_cli.get_all_host_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 160, in get_all_host_stats
    return self.get_all_stats(self.StatModes.HOST)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats
    self._configure_broker_conn(broker)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn
    dom_type=dom_type)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain
    .format(sd_type, options, e))
ovirt_hosted_engine_ha.lib.exceptions.RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'glusterfs', 'sd_uuid': '537b7b03-7ec7-4fd3-bcf5-000d7a5d162f'}: Request failed: <class 'ovirt_hosted_engine_ha.lib.storage_backends.BackendFailureException'>


Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup-1.3.3.4-1.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Perform Hyper convergence setup by following the grafton installation doc.
2. Now bring down glusterd in one of the node.
3. Run the command 'hosted-engine --vm-status'

Actual results:
hosted-engine vm-status command displays python exception.

Expected results:
hosted engine vm-status command should nod displays any exceptions / back traces.

Additional info:

Comment 1 RamaKasturi 2016-03-21 09:32:36 UTC
please find the sos reports in the link given below.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1319631/

Comment 2 Yaniv Lavi 2016-03-23 10:54:56 UTC
Does this happen on non converged setups?

Comment 3 RamaKasturi 2016-03-24 06:33:29 UTC
I am not sure about non converged setups.

Comment 4 Simone Tiraboschi 2016-03-24 15:19:24 UTC
This is in general an issue since ha-agent basically expect that the storage is up while the single hypervisor host could be down.
So basically hosted-engine --vm-status reports about the status of the VM and the status of each involved host.
In an hyper-converged scenario we should probably incluse also some specific information about the status of the gluster volume and bricks used by hosted-engine since they are on the same hosts we are reporting about.

Comment 5 Simone Tiraboschi 2016-03-24 16:44:04 UTC
You deployed hosted-engine creating its storage domain on rhsqa1.lab.eng.blr.redhat.com:/engine

We know that there is a SPOF issue on the gluster entry point (see https://bugzilla.redhat.com/1298693 ) so ovirt-ha-agent can fail on the two remaining hosts when you bring down rhsqa1 

Here the issue in VDSM logs:

Thread-806237::INFO::2016-03-21 14:23:59,690::logUtils::48::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=7, spUUID='00000000-0000-0000-0000-000000000000', conList=[{'id': '7926ce79-7846-466a-aa13-5296272b1d24', 'vfs_type': 'glusterfs', 'connection': 'rhsqa1.lab.eng.blr.redhat.com:/engine', 'user': 'kvm'}], options=None)
Thread-806237::ERROR::2016-03-21 14:23:59,794::hsm::2473::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2470, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 223, in connect
    self.validate()
  File "/usr/share/vdsm/storage/storageServer.py", line 348, in validate
    replicaCount = self.volinfo['replicaCount']
  File "/usr/share/vdsm/storage/storageServer.py", line 335, in volinfo
    self._volinfo = self._get_gluster_volinfo()
  File "/usr/share/vdsm/storage/storageServer.py", line 372, in _get_gluster_volinfo
    self._volfileserver)
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
    **kwargs)
  File "<string>", line 2, in glusterVolumeInfo
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
    raise convert_to_error(kind, result)
GlusterCmdExecFailedException: Command execution failed
error: Connection failed. Please check if gluster daemon is operational.
return code: 1
Thread-806237::DEBUG::2016-03-21 14:23:59,795::hsm::2497::Storage.HSM::(connectStorageServer) knownSDs: {537b7b03-7ec7-4fd3-bcf5-000d7a5d162f: storage.glusterSD.findDomain, a39b189a-f7cc-4179-8011-472b39ba3350: storage.glusterSD.findDomain, c93b9e34-2b5f-4a92-9301-70a01f9de94d: storage.glusterSD.findDomain}
Thread-806237::INFO::2016-03-21 14:23:59,795::logUtils::51::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 4105, 'id': '7926ce79-7846-466a-aa13-5296272b1d24'}]}


(In reply to RamaKasturi from comment #0)
> stop glusterd in one of the node in the cluster. 

Which one did you stopped, rhsqa1? if so this is basically just a duplicate of 1298693 with something to cleanup on the output of 'hosted-engine --vm-status' when the storage is not available.

Comment 6 RamaKasturi 2016-03-28 18:15:51 UTC
(In reply to Simone Tiraboschi from comment #5)
> You deployed hosted-engine creating its storage domain on
> rhsqa1.lab.eng.blr.redhat.com:/engine
> 
> We know that there is a SPOF issue on the gluster entry point (see
> https://bugzilla.redhat.com/1298693 ) so ovirt-ha-agent can fail on the two
> remaining hosts when you bring down rhsqa1 
> 
> Here the issue in VDSM logs:
> 
> Thread-806237::INFO::2016-03-21
> 14:23:59,690::logUtils::48::dispatcher::(wrapper) Run and protect:
> connectStorageServer(domType=7,
> spUUID='00000000-0000-0000-0000-000000000000', conList=[{'id':
> '7926ce79-7846-466a-aa13-5296272b1d24', 'vfs_type': 'glusterfs',
> 'connection': 'rhsqa1.lab.eng.blr.redhat.com:/engine', 'user': 'kvm'}],
> options=None)
> Thread-806237::ERROR::2016-03-21
> 14:23:59,794::hsm::2473::Storage.HSM::(connectStorageServer) Could not
> connect to storageServer
> Traceback (most recent call last):
>   File "/usr/share/vdsm/storage/hsm.py", line 2470, in connectStorageServer
>     conObj.connect()
>   File "/usr/share/vdsm/storage/storageServer.py", line 223, in connect
>     self.validate()
>   File "/usr/share/vdsm/storage/storageServer.py", line 348, in validate
>     replicaCount = self.volinfo['replicaCount']
>   File "/usr/share/vdsm/storage/storageServer.py", line 335, in volinfo
>     self._volinfo = self._get_gluster_volinfo()
>   File "/usr/share/vdsm/storage/storageServer.py", line 372, in
> _get_gluster_volinfo
>     self._volfileserver)
>   File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
>     return callMethod()
>   File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
>     **kwargs)
>   File "<string>", line 2, in glusterVolumeInfo
>   File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in
> _callmethod
>     raise convert_to_error(kind, result)
> GlusterCmdExecFailedException: Command execution failed
> error: Connection failed. Please check if gluster daemon is operational.
> return code: 1
> Thread-806237::DEBUG::2016-03-21
> 14:23:59,795::hsm::2497::Storage.HSM::(connectStorageServer) knownSDs:
> {537b7b03-7ec7-4fd3-bcf5-000d7a5d162f: storage.glusterSD.findDomain,
> a39b189a-f7cc-4179-8011-472b39ba3350: storage.glusterSD.findDomain,
> c93b9e34-2b5f-4a92-9301-70a01f9de94d: storage.glusterSD.findDomain}
> Thread-806237::INFO::2016-03-21
> 14:23:59,795::logUtils::51::dispatcher::(wrapper) Run and protect:
> connectStorageServer, Return response: {'statuslist': [{'status': 4105,
> 'id': '7926ce79-7846-466a-aa13-5296272b1d24'}]}
> 
> 
> (In reply to RamaKasturi from comment #0)
> > stop glusterd in one of the node in the cluster. 
> 
> Which one did you stopped, rhsqa1? if so this is basically just a duplicate
> of 1298693 with something to cleanup on the output of 'hosted-engine
> --vm-status' when the storage is not available.

yes, i have stopped it on rhsqa1

Comment 7 Simone Tiraboschi 2016-03-29 08:50:55 UTC
OK, so let's close this one as a duplicate of 1298693 and let's open a new RFE to have hosted-engine --vm-status showing also the storage status on hyperconverged setup.

*** This bug has been marked as a duplicate of bug 1298693 ***


Note You need to log in before you can comment on or make changes to this bug.