Bug 1519201

Summary: WA doesn't reflect that all gluster nodes are down
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Martin Kudlej <mkudlej>
Component: web-admin-tendrl-monitoring-integrationAssignee: Anmol Sachan <asachan>
Status: CLOSED ERRATA QA Contact: Filip Balák <fbalak>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: asachan, asriram, fbalak, mkudlej, nthomas, rhs-bugs, sanandpa, sankarshan, srmukher, ssaha
Target Milestone: ---   
Target Release: RHGS 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tendrl-node-agent-1.6.3-7.el7rhgs tendrl-monitoring-integration-1.6.3-5.el7rhgs tendrl-gluster-integration-1.6.3-5.el7rhgs Doc Type: Known Issue
Doc Text:
When the entire gluster cluster goes down because the hosts go down simultaneously, the WA dashboard only displays information about the cluster and the nodes being unhealthy. It does not provide detailed information about the health of the bricks and volumes.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 07:00:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1516845    
Bug Blocks: 1503134    
Attachments:
Description Flags
gl1 is down
none
gl1 is up
none
some charts don't reflect status of nodes none

Description Martin Kudlej 2017-11-30 11:24:44 UTC
Created attachment 1360864 [details]
gl1 is down

Description of problem:
This bug is probably related to bug 1508041 and I've found it during testing of bug 1517468.

See screenshots.

Version-Release number of selected component (if applicable):
etcd-3.2.7-1.el7.x86_64
glusterfs-3.8.4-52.el7_4.x86_64
glusterfs-client-xlators-3.8.4-52.el7_4.x86_64
glusterfs-fuse-3.8.4-52.el7_4.x86_64
glusterfs-libs-3.8.4-52.el7_4.x86_64
python-etcd-0.4.5-1.el7rhgs.noarch
rubygem-etcd-0.3.0-1.el7rhgs.noarch
tendrl-ansible-1.5.4-2.el7rhgs.noarch
tendrl-api-1.5.4-3.el7rhgs.noarch
tendrl-api-httpd-1.5.4-3.el7rhgs.noarch
tendrl-commons-1.5.4-5.el7rhgs.noarch
tendrl-grafana-plugins-1.5.4-8.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-1.el7rhgs.noarch
tendrl-monitoring-integration-1.5.4-8.el7rhgs.noarch
tendrl-node-agent-1.5.4-8.el7rhgs.noarch
tendrl-notifier-1.5.4-5.el7rhgs.noarch
tendrl-selinux-1.5.4-1.el7rhgs.noarch
tendrl-ui-1.5.4-4.el7rhgs.noarch

How reproducible:
100%

Steps to Reproduce:
1. install and setup WA and gluster, import gluster cluster into WA, wait for about hour
2. shut down all gluster nodes (this can be real situation in case of big failure)
3. after about 30 minutes(to be completely sure that shown data is correct) check Grafana dashboards and WA UI

Actual results:
Data shown in Grafana and WA UI is not correct and doesn't reflect reality. There is difference between some of info in UI (in one list node "gl1" is down and in another one node "gl1" is up).

Expected results:
All charts related to status are in red and there are related alerts about this situation.

Comment 1 Martin Kudlej 2017-11-30 11:29:15 UTC
Created attachment 1360866 [details]
gl1 is up

Comment 2 Martin Kudlej 2017-11-30 11:31:39 UTC
Created attachment 1360867 [details]
some charts don't reflect status of nodes

Comment 4 Martin Kudlej 2017-11-30 11:48:44 UTC
I haven't requested screenshot because I haven't expected that I find new bug.

I've run "shutdown -h now" on all Gluster nodes.

Comment 5 Nishanth Thomas 2017-12-06 12:40:25 UTC
I am not able to reproduce this issue with latest builds. After the reboot I could see that host status information is up-to-date in tendrl UI and grafana dashboard.
Having said that, there are issues around the updates of volumes, bricks etc on the grafana dashbord(which is not discussed as part of this bug) when all the nodes are shut-down. This is because the all the agents(which is responsible for this updates) running on the nodes are down. This needs to be tackled differently. I don't think this is something which can be taken in for this release. Also this scenario is very rare in a production environment. Even if happens, the host down status is correctly indicated on the dashboard and that's a good enough indication for the administrator to take action . 

Having discussed it with QE(Sweta), it has been agreed to document this bug as a known_issue for this release.

Comment 10 Nishanth Thomas 2017-12-15 12:35:57 UTC
Updated, pls check

Comment 18 Anmol Sachan 2018-05-29 15:01:19 UTC
*** Bug 1583724 has been marked as a duplicate of this bug. ***

Comment 19 Anmol Sachan 2018-05-29 15:02:59 UTC
*** Bug 1583727 has been marked as a duplicate of this bug. ***

Comment 20 Filip Balák 2018-06-07 11:35:12 UTC
I tested several times the scenario. The current status:
 * All nodes in Hosts page are Down as expected.
 * There remains at least one (the last shut down node) as Up in Grafana.
 * Volume disappears from UI (BZ 1588436).
 * Not all bricks are Down in UI and in Grafana.
--> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-6.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch
tendrl-node-agent-1.6.3-6.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-3.el7rhgs.noarch

Comment 21 Nishanth Thomas 2018-06-13 08:20:47 UTC
@anmol, please take a look at https://bugzilla.redhat.com/show_bug.cgi?id=1588436#c8

Comment 23 Filip Balák 2018-07-03 11:37:11 UTC
Looks ok. All status panels in Grafana and UI reflect the status of hosts and bricks correctly and alerts are raised. --> VERIFIED

Tested with:
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch

Comment 26 errata-xmlrpc 2018-09-04 07:00:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616