1412991 – [RHSC]: One /bad/ host in console makes all the other hosts unreachable/non-responsive.

Bug 1412991 - [RHSC]: One /bad/ host in console makes all the other hosts unreachable/non-responsive.

Summary: [RHSC]: One /bad/ host in console makes all the other hosts unreachable/non-r...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rhsc
Sub Component:
Version:	rhgs-3.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Sahina Bose
QA Contact:	Sweta Anandpara
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-13 10:53 UTC by Sweta Anandpara
Modified:	2018-10-24 06:12 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-10-24 06:12:13 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Sweta Anandpara 2017-01-13 10:53:52 UTC

Description of problem:
=====================
Hit this issue when I had a RHGS-console managment node, which had 3 clusters. 
Cluster1: 4 node 3.1.3 build
Cluster2: 4 node 3.1.3 build
Cluster3: 6 node 3.2 interim build (3.8.4-10)

Raised BZ 1412982, which talked about issue faced in Cluster3. When Cluster3 was not functioning to its full health, we moved all the hosts of cluster3 to maintenance from Console. But as soon as we activate one (or all) of the 6 hosts, it used to make all the hosts of Cluster1 and Cluster2 unresponsive, or not reachable. When the hosts of Cluster3 were again moved back to maintenance, then the hosts of Cluster1 and Cluster2 used to come back up, by themselves. 

Things to note:
* I had 16 nodes managed from the single console. I am not sure if we recommend a limit to number of hosts.
* I had a mix of two gluster versions managed from the same node, albeit in different clusters - 3.1.3 and 3.2. Do we forbid our customers to have this kinda setup?
* I had a mix of RHEL versions - Cluster1 was RHEL6 and Cluster2 was RHEL7. Again, I do not think that should have been a problem. 

ovirt-engine logs will be posted at http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/<bugnumber>/
Meanwhile, will update this space if I hit this again, and with more information if I can gather.

Version-Release number of selected component (if applicable):
===========================================================
[root@dhcp35-51 ovirt-engine]# rpm -qa | grep gluster
gluster-nagios-common-0.2.4-1.el6rhs.noarch
[root@dhcp35-51 ovirt-engine]# rpm -qa | grep vdsm
vdsm-jsonrpc-java-1.0.15-1.el6ev.noarch
[root@dhcp35-51 ovirt-engine]# rpm -qa | grep rhsc
rhsc-log-collector-3.1.0-1.0.el6rhs.noarch
rhsc-branding-rhs-3.1.2-0.el6rhs.noarch
rhsc-setup-plugin-ovirt-engine-3.1.3-0.73.el6.noarch
rhsc-setup-plugins-3.1.3-1.el6rhs.noarch
rhsc-extensions-api-impl-3.1.3-0.73.el6.noarch
redhat-access-plugin-rhsc-3.1.3-0.el6.noarch
rhsc-cli-3.0.0.0-0.2.el6rhs.noarch
rhsc-setup-plugin-ovirt-engine-common-3.1.3-0.73.el6.noarch
rhsc-webadmin-portal-3.1.3-0.73.el6.noarch
rhsc-tools-3.1.3-0.73.el6.noarch
rhsc-sdk-python-3.0.0.0-0.2.el6rhs.noarch
rhsc-monitoring-uiplugin-0.2.4-1.el6rhs.noarch
rhsc-setup-base-3.1.3-0.73.el6.noarch
rhsc-dbscripts-3.1.3-0.73.el6.noarch
rhsc-backend-3.1.3-0.73.el6.noarch
rhsc-doc-3.1.2-0.el6eng.noarch
rhsc-restapi-3.1.3-0.73.el6.noarch
rhsc-3.1.3-0.73.el6.noarch
rhsc-lib-3.1.3-0.73.el6.noarch
rhsc-setup-3.1.3-0.73.el6.noarch
[root@dhcp35-51 ovirt-engine]#


How reproducible:
===============
Hit it once

Comment 2 Sweta Anandpara 2017-01-13 10:57:01 UTC

[qe@rhsqe-repo 1412991]$ hostname
rhsqe-repo.lab.eng.blr.redhat.com
[qe@rhsqe-repo 1412991]$ 
[qe@rhsqe-repo 1412991]$ 
[qe@rhsqe-repo 1412991]$ ls
ovirt-engine
[qe@rhsqe-repo 1412991]$ 
[qe@rhsqe-repo 1412991]$ pwd
/home/repo/sosreports/1412991
[qe@rhsqe-repo 1412991]$ 
[qe@rhsqe-repo 1412991]$ 
[qe@rhsqe-repo 1412991]$ ls -lrt ovirt-engine/
total 44004
-rwxr-xr-x. 1 qe qe   290838 Jan 13 16:18 server.log
drwxr-xr-x. 2 qe qe     4096 Jan 13 16:18 ovirt-log-collector
-rwxr-xr-x. 1 qe qe  3572970 Jan 13 16:18 engine.log-20170112.gz
-rwxr-xr-x. 1 qe qe  3028435 Jan 13 16:18 engine.log-20170101.gz
-rwxr-xr-x. 1 qe qe 16712194 Jan 13 16:18 engine.log
-rwxr-xr-x. 1 qe qe  1028842 Jan 13 16:18 engine.log-20170110.gz
drwxr-xr-x. 2 qe qe     4096 Jan 13 16:18 dump
-rwxr-xr-x. 1 qe qe  4298763 Jan 13 16:18 engine.log-20170103.gz
drwxr-xr-x. 2 qe qe     4096 Jan 13 16:18 host-deploy
-rwxr-xr-x. 1 qe qe     1565 Jan 13 16:18 boot.log
-rwxr-xr-x. 1 qe qe  2540303 Jan 13 16:18 engine.log-20170113.gz
-rwxr-xr-x. 1 qe qe  4354855 Jan 13 16:18 engine.log-20170102.gz
-rwxr-xr-x. 1 qe qe  1854237 Jan 13 16:18 engine.log-20170109.gz
-rwxr-xr-x. 1 qe qe        0 Jan 13 16:18 console.log
-rwxr-xr-x. 1 qe qe  3032538 Jan 13 16:18 engine.log-20170111.gz
drwxr-xr-x. 2 qe qe     4096 Jan 13 16:18 setup
drwxr-xr-x. 2 qe qe     4096 Jan 13 16:18 notifier
-rwxr-xr-x. 1 qe qe  4288885 Jan 13 16:18 engine.log-20170104.gz
[qe@rhsqe-repo 1412991]$

Comment 3 Ramesh N 2017-01-13 11:29:55 UTC

Sweta: We need to sequence steps to reproduce this bug. Othrewise it is totally impossible to understand what is happening in the system.

Comment 8 Sahina Bose 2018-10-24 06:12:13 UTC

Closing as there's no further enhancements planned on RHGS-C

Note You need to log in before you can comment on or make changes to this bug.