1519201 – WA doesn't reflect that all gluster nodes are down

Bug 1519201 - WA doesn't reflect that all gluster nodes are down

Summary: WA doesn't reflect that all gluster nodes are down

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-monitoring-integration
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	RHGS 3.4.0
Assignee:	Anmol Sachan
QA Contact:	Filip Balák
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1583724 1583727 (view as bug list)
Depends On:	1516845
Blocks:	1503134
TreeView+	depends on / blocked

Reported:	2017-11-30 11:24 UTC by Martin Kudlej
Modified:	2018-09-04 07:01 UTC (History)
CC List:	10 users (show)
Fixed In Version:	tendrl-node-agent-1.6.3-7.el7rhgs tendrl-monitoring-integration-1.6.3-5.el7rhgs tendrl-gluster-integration-1.6.3-5.el7rhgs
Doc Type:	Known Issue
Doc Text:	When the entire gluster cluster goes down because the hosts go down simultaneously, the WA dashboard only displays information about the cluster and the nodes being unhealthy. It does not provide detailed information about the health of the bricks and volumes.
Clone Of:
Environment:
Last Closed:	2018-09-04 07:00:31 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
gl1 is down (46.12 KB, image/png) 2017-11-30 11:24 UTC, Martin Kudlej	no flags	Details
gl1 is up (66.65 KB, image/png) 2017-11-30 11:29 UTC, Martin Kudlej	no flags	Details
some charts don't reflect status of nodes (115.70 KB, image/png) 2017-11-30 11:31 UTC, Martin Kudlej	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	/Tendrl commons issues 979	None	None	None	2018-05-29 15:16:40 UTC
Github	Tendrl commons issues 988	None	None	None	2018-06-14 22:59:23 UTC
Github	Tendrl gluster-integration issues 659	None	None	None	2018-06-14 22:59:58 UTC
Github	Tendrl node-agent issues 714	None	None	None	2018-01-18 08:33:40 UTC
Github	Tendrl node-agent issues 820	None	None	None	2018-05-29 15:24:49 UTC
Red Hat Bugzilla	1508041	unspecified	CLOSED	5 from 6 nodes are down and some chart don't reflect it	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHSA-2018:2616	None	None	None	2018-09-04 07:01:24 UTC

Internal Links: 1508041

Description Martin Kudlej 2017-11-30 11:24:44 UTC

Created attachment 1360864 [details]
gl1 is down

Description of problem:
This bug is probably related to bug 1508041 and I've found it during testing of bug 1517468.

See screenshots.

Version-Release number of selected component (if applicable):
etcd-3.2.7-1.el7.x86_64
glusterfs-3.8.4-52.el7_4.x86_64
glusterfs-client-xlators-3.8.4-52.el7_4.x86_64
glusterfs-fuse-3.8.4-52.el7_4.x86_64
glusterfs-libs-3.8.4-52.el7_4.x86_64
python-etcd-0.4.5-1.el7rhgs.noarch
rubygem-etcd-0.3.0-1.el7rhgs.noarch
tendrl-ansible-1.5.4-2.el7rhgs.noarch
tendrl-api-1.5.4-3.el7rhgs.noarch
tendrl-api-httpd-1.5.4-3.el7rhgs.noarch
tendrl-commons-1.5.4-5.el7rhgs.noarch
tendrl-grafana-plugins-1.5.4-8.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-1.el7rhgs.noarch
tendrl-monitoring-integration-1.5.4-8.el7rhgs.noarch
tendrl-node-agent-1.5.4-8.el7rhgs.noarch
tendrl-notifier-1.5.4-5.el7rhgs.noarch
tendrl-selinux-1.5.4-1.el7rhgs.noarch
tendrl-ui-1.5.4-4.el7rhgs.noarch

How reproducible:
100%

Steps to Reproduce:
1. install and setup WA and gluster, import gluster cluster into WA, wait for about hour
2. shut down all gluster nodes (this can be real situation in case of big failure)
3. after about 30 minutes(to be completely sure that shown data is correct) check Grafana dashboards and WA UI

Actual results:
Data shown in Grafana and WA UI is not correct and doesn't reflect reality. There is difference between some of info in UI (in one list node "gl1" is down and in another one node "gl1" is up).

Expected results:
All charts related to status are in red and there are related alerts about this situation.

Comment 1 Martin Kudlej 2017-11-30 11:29:15 UTC

Created attachment 1360866 [details]
gl1 is up

Comment 2 Martin Kudlej 2017-11-30 11:31:39 UTC

Created attachment 1360867 [details]
some charts don't reflect status of nodes

Comment 4 Martin Kudlej 2017-11-30 11:48:44 UTC

I haven't requested screenshot because I haven't expected that I find new bug.

I've run "shutdown -h now" on all Gluster nodes.

Comment 5 Nishanth Thomas 2017-12-06 12:40:25 UTC

I am not able to reproduce this issue with latest builds. After the reboot I could see that host status information is up-to-date in tendrl UI and grafana dashboard.
Having said that, there are issues around the updates of volumes, bricks etc on the grafana dashbord(which is not discussed as part of this bug) when all the nodes are shut-down. This is because the all the agents(which is responsible for this updates) running on the nodes are down. This needs to be tackled differently. I don't think this is something which can be taken in for this release. Also this scenario is very rare in a production environment. Even if happens, the host down status is correctly indicated on the dashboard and that's a good enough indication for the administrator to take action . 

Having discussed it with QE(Sweta), it has been agreed to document this bug as a known_issue for this release.

Comment 10 Nishanth Thomas 2017-12-15 12:35:57 UTC

Updated, pls check

Comment 18 Anmol Sachan 2018-05-29 15:01:19 UTC

*** Bug 1583724 has been marked as a duplicate of this bug. ***

Comment 19 Anmol Sachan 2018-05-29 15:02:59 UTC

*** Bug 1583727 has been marked as a duplicate of this bug. ***

Comment 20 Filip Balák 2018-06-07 11:35:12 UTC

I tested several times the scenario. The current status:
 * All nodes in Hosts page are Down as expected.
 * There remains at least one (the last shut down node) as Up in Grafana.
 * Volume disappears from UI (BZ 1588436).
 * Not all bricks are Down in UI and in Grafana.
--> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-6.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch
tendrl-node-agent-1.6.3-6.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-3.el7rhgs.noarch

Comment 21 Nishanth Thomas 2018-06-13 08:20:47 UTC

@anmol, please take a look at https://bugzilla.redhat.com/show_bug.cgi?id=1588436#c8

Comment 22 gowtham 2018-06-14 23:00:53 UTC

PRs are under review https://github.com/Tendrl/commons/pull/989 https://github.com/Tendrl/gluster-integration/pull/660

Comment 23 Filip Balák 2018-07-03 11:37:11 UTC

Looks ok. All status panels in Grafana and UI reflect the status of hosts and bricks correctly and alerts are raised. --> VERIFIED

Tested with:
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch

Comment 26 errata-xmlrpc 2018-09-04 07:00:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616

Note You need to log in before you can comment on or make changes to this bug.