Bug 1519201 - WA doesn't reflect that all gluster nodes are down
Summary: WA doesn't reflect that all gluster nodes are down
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: web-admin-tendrl-monitoring-integration
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: RHGS 3.4.0
Assignee: Anmol Sachan
QA Contact: Filip Balák
URL:
Whiteboard:
: 1583724 1583727 (view as bug list)
Depends On: 1516845
Blocks: 1503134
TreeView+ depends on / blocked
 
Reported: 2017-11-30 11:24 UTC by Martin Kudlej
Modified: 2018-09-04 07:01 UTC (History)
10 users (show)

Fixed In Version: tendrl-node-agent-1.6.3-7.el7rhgs tendrl-monitoring-integration-1.6.3-5.el7rhgs tendrl-gluster-integration-1.6.3-5.el7rhgs
Doc Type: Known Issue
Doc Text:
When the entire gluster cluster goes down because the hosts go down simultaneously, the WA dashboard only displays information about the cluster and the nodes being unhealthy. It does not provide detailed information about the health of the bricks and volumes.
Clone Of:
Environment:
Last Closed: 2018-09-04 07:00:31 UTC
Embargoed:


Attachments (Terms of Use)
gl1 is down (46.12 KB, image/png)
2017-11-30 11:24 UTC, Martin Kudlej
no flags Details
gl1 is up (66.65 KB, image/png)
2017-11-30 11:29 UTC, Martin Kudlej
no flags Details
some charts don't reflect status of nodes (115.70 KB, image/png)
2017-11-30 11:31 UTC, Martin Kudlej
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github /Tendrl commons issues 979 0 None None None 2018-05-29 15:16:40 UTC
Github Tendrl commons issues 988 0 None None None 2018-06-14 22:59:23 UTC
Github Tendrl gluster-integration issues 659 0 None None None 2018-06-14 22:59:58 UTC
Github Tendrl node-agent issues 714 0 None None None 2018-01-18 08:33:40 UTC
Github Tendrl node-agent issues 820 0 None None None 2018-05-29 15:24:49 UTC
Red Hat Bugzilla 1508041 0 unspecified CLOSED 5 from 6 nodes are down and some chart don't reflect it 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHSA-2018:2616 0 None None None 2018-09-04 07:01:24 UTC

Internal Links: 1508041

Description Martin Kudlej 2017-11-30 11:24:44 UTC
Created attachment 1360864 [details]
gl1 is down

Description of problem:
This bug is probably related to bug 1508041 and I've found it during testing of bug 1517468.

See screenshots.

Version-Release number of selected component (if applicable):
etcd-3.2.7-1.el7.x86_64
glusterfs-3.8.4-52.el7_4.x86_64
glusterfs-client-xlators-3.8.4-52.el7_4.x86_64
glusterfs-fuse-3.8.4-52.el7_4.x86_64
glusterfs-libs-3.8.4-52.el7_4.x86_64
python-etcd-0.4.5-1.el7rhgs.noarch
rubygem-etcd-0.3.0-1.el7rhgs.noarch
tendrl-ansible-1.5.4-2.el7rhgs.noarch
tendrl-api-1.5.4-3.el7rhgs.noarch
tendrl-api-httpd-1.5.4-3.el7rhgs.noarch
tendrl-commons-1.5.4-5.el7rhgs.noarch
tendrl-grafana-plugins-1.5.4-8.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-1.el7rhgs.noarch
tendrl-monitoring-integration-1.5.4-8.el7rhgs.noarch
tendrl-node-agent-1.5.4-8.el7rhgs.noarch
tendrl-notifier-1.5.4-5.el7rhgs.noarch
tendrl-selinux-1.5.4-1.el7rhgs.noarch
tendrl-ui-1.5.4-4.el7rhgs.noarch

How reproducible:
100%

Steps to Reproduce:
1. install and setup WA and gluster, import gluster cluster into WA, wait for about hour
2. shut down all gluster nodes (this can be real situation in case of big failure)
3. after about 30 minutes(to be completely sure that shown data is correct) check Grafana dashboards and WA UI

Actual results:
Data shown in Grafana and WA UI is not correct and doesn't reflect reality. There is difference between some of info in UI (in one list node "gl1" is down and in another one node "gl1" is up).

Expected results:
All charts related to status are in red and there are related alerts about this situation.

Comment 1 Martin Kudlej 2017-11-30 11:29:15 UTC
Created attachment 1360866 [details]
gl1 is up

Comment 2 Martin Kudlej 2017-11-30 11:31:39 UTC
Created attachment 1360867 [details]
some charts don't reflect status of nodes

Comment 4 Martin Kudlej 2017-11-30 11:48:44 UTC
I haven't requested screenshot because I haven't expected that I find new bug.

I've run "shutdown -h now" on all Gluster nodes.

Comment 5 Nishanth Thomas 2017-12-06 12:40:25 UTC
I am not able to reproduce this issue with latest builds. After the reboot I could see that host status information is up-to-date in tendrl UI and grafana dashboard.
Having said that, there are issues around the updates of volumes, bricks etc on the grafana dashbord(which is not discussed as part of this bug) when all the nodes are shut-down. This is because the all the agents(which is responsible for this updates) running on the nodes are down. This needs to be tackled differently. I don't think this is something which can be taken in for this release. Also this scenario is very rare in a production environment. Even if happens, the host down status is correctly indicated on the dashboard and that's a good enough indication for the administrator to take action . 

Having discussed it with QE(Sweta), it has been agreed to document this bug as a known_issue for this release.

Comment 10 Nishanth Thomas 2017-12-15 12:35:57 UTC
Updated, pls check

Comment 18 Anmol Sachan 2018-05-29 15:01:19 UTC
*** Bug 1583724 has been marked as a duplicate of this bug. ***

Comment 19 Anmol Sachan 2018-05-29 15:02:59 UTC
*** Bug 1583727 has been marked as a duplicate of this bug. ***

Comment 20 Filip Balák 2018-06-07 11:35:12 UTC
I tested several times the scenario. The current status:
 * All nodes in Hosts page are Down as expected.
 * There remains at least one (the last shut down node) as Up in Grafana.
 * Volume disappears from UI (BZ 1588436).
 * Not all bricks are Down in UI and in Grafana.
--> ASSIGNED

Tested with:
tendrl-ansible-1.6.3-4.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-6.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-4.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-4.el7rhgs.noarch
tendrl-node-agent-1.6.3-6.el7rhgs.noarch
tendrl-notifier-1.6.3-3.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-3.el7rhgs.noarch

Comment 21 Nishanth Thomas 2018-06-13 08:20:47 UTC
@anmol, please take a look at https://bugzilla.redhat.com/show_bug.cgi?id=1588436#c8

Comment 23 Filip Balák 2018-07-03 11:37:11 UTC
Looks ok. All status panels in Grafana and UI reflect the status of hosts and bricks correctly and alerts are raised. --> VERIFIED

Tested with:
tendrl-ansible-1.6.3-5.el7rhgs.noarch
tendrl-api-1.6.3-3.el7rhgs.noarch
tendrl-api-httpd-1.6.3-3.el7rhgs.noarch
tendrl-commons-1.6.3-7.el7rhgs.noarch
tendrl-grafana-plugins-1.6.3-5.el7rhgs.noarch
tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
tendrl-monitoring-integration-1.6.3-5.el7rhgs.noarch
tendrl-node-agent-1.6.3-7.el7rhgs.noarch
tendrl-notifier-1.6.3-4.el7rhgs.noarch
tendrl-selinux-1.5.4-2.el7rhgs.noarch
tendrl-ui-1.6.3-4.el7rhgs.noarch

Comment 26 errata-xmlrpc 2018-09-04 07:00:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2616


Note You need to log in before you can comment on or make changes to this bug.