1619170 – Stopped all gluster daemons on all storage servers is not properly reflected

Bug 1619170 - Stopped all gluster daemons on all storage servers is not properly reflected

Summary: Stopped all gluster daemons on all storage servers is not properly reflected

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	web-admin-tendrl-gluster-integration
Sub Component:
Version:	rhgs-3.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Shubhendu Tripathi
QA Contact:	sds-qe-bugs
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-20 08:56 UTC by Daniel Horák
Modified:	2019-05-08 17:46 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-05-08 15:52:14 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Cluster moved to unhealthy state. (33.96 KB, image/png) 2018-08-20 09:07 UTC, Daniel Horák	no flags	Details
No Volumes Detected (volumes disappear from WA Volumes page). (19.14 KB, image/png) 2018-08-20 09:08 UTC, Daniel Horák	no flags	Details
Grafana Cluster dashboard: Unhealthy state, no volumes, alll Bricks UP (116.02 KB, image/png) 2018-08-20 09:13 UTC, Daniel Horák	no flags	Details
Grafana Volume dashboard: Health N/A, all Bricks Up (123.76 KB, image/png) 2018-08-20 09:17 UTC, Daniel Horák	no flags	Details
Grafana Brick Dashboard: Status Started (86.54 KB, image/png) 2018-08-20 09:20 UTC, Daniel Horák	no flags	Details
View All

Description Daniel Horák 2018-08-20 08:56:21 UTC

Description of problem:
  When all gluster* daemons on all storage nodes are stopped (or crash),
  WA doesn't reflect this state properly.

  * The cluster is correctly moved to unhealthy state, but without any clearer
    explanation what is the reason.
  * All volumes disappear from WA Volumes page, but they remain in Grafana and
    they are reported as Up.
  * All bricks are reported as Up/Started in Grafana.

Version-Release number of selected component (if applicable):
  RHGS WA Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  carbon-selinux-1.5.4-2.el7rhgs.noarch
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  etcd-3.2.7-1.el7.x86_64
  grafana-4.3.2-3.el7rhgs.x86_64
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  python-carbon-0.9.15-2.1.el7rhgs.noarch
  python-etcd-0.4.5-2.el7rhgs.noarch
  rubygem-etcd-0.3.0-2.el7rhgs.noarch
  tendrl-ansible-1.6.3-6.el7rhgs.noarch
  tendrl-api-1.6.3-5.el7rhgs.noarch
  tendrl-api-httpd-1.6.3-5.el7rhgs.noarch
  tendrl-commons-1.6.3-12.el7rhgs.noarch
  tendrl-grafana-plugins-1.6.3-10.el7rhgs.noarch
  tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-monitoring-integration-1.6.3-10.el7rhgs.noarch
  tendrl-node-agent-1.6.3-10.el7rhgs.noarch
  tendrl-notifier-1.6.3-4.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-ui-1.6.3-10.el7rhgs.noarch

  Gluster Storage Server:
  Red Hat Enterprise Linux Server release 7.5 (Maipo)
  Red Hat Gluster Storage Server 3.4.0
  collectd-5.7.2-3.1.el7rhgs.x86_64
  collectd-ping-5.7.2-3.1.el7rhgs.x86_64
  glusterfs-3.12.2-16.el7rhgs.x86_64
  glusterfs-api-3.12.2-16.el7rhgs.x86_64
  glusterfs-cli-3.12.2-16.el7rhgs.x86_64
  glusterfs-client-xlators-3.12.2-16.el7rhgs.x86_64
  glusterfs-events-3.12.2-16.el7rhgs.x86_64
  glusterfs-fuse-3.12.2-16.el7rhgs.x86_64
  glusterfs-geo-replication-3.12.2-16.el7rhgs.x86_64
  glusterfs-libs-3.12.2-16.el7rhgs.x86_64
  glusterfs-rdma-3.12.2-16.el7rhgs.x86_64
  glusterfs-server-3.12.2-16.el7rhgs.x86_64
  gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64
  gluster-nagios-common-0.2.4-1.el7rhgs.noarch
  libcollectdclient-5.7.2-3.1.el7rhgs.x86_64
  libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64
  python2-gluster-3.12.2-16.el7rhgs.x86_64
  python-etcd-0.4.5-2.el7rhgs.noarch
  tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch
  tendrl-commons-1.6.3-12.el7rhgs.noarch
  tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch
  tendrl-node-agent-1.6.3-10.el7rhgs.noarch
  tendrl-selinux-1.5.4-2.el7rhgs.noarch
  vdsm-gluster-4.19.43-2.3.el7rhgs.noarch

How reproducible:
  100%

Steps to Reproduce:
1. Install and configure Gluster Storage Cluster with at least one volume.
2. Install and configure RHGS WA.
3. Import Gluster Storage Cluster into RHGS WA.
4. Wait few minutes, to synchronize all the required information.
5. Stop/kill all gluster processes on all storage nodes:
    # systemctl stop  'gluster*'
    # pkill glusterfs
6. Wait few minutes and watch various pages of RHGS WA and Grafana

Actual results:
  * following alert is raised:
      Cluster .... moved to unhealthy state
  * cluster is in Unhealthy state - both on WA Clusters page and in Grafana
  * all volumes disappear from WA Volumes page
  * all bricks are marked as Up/Started in Grafana

Expected results:
  * it should be clearly visible, what is the real problem/that the impact of
    the issue on Gluster site is huge,
  * volumes probably shouldn't disappear from WA and
    should be marked as down both in WA and on Grafana dashboard,
  * bricks shouldn't be reported as Up/Started

Additional info:
  See attached screenshots.

Comment 1 Daniel Horák 2018-08-20 09:07:21 UTC

Created attachment 1477107 [details]
Cluster moved to unhealthy state.

Cluster is marked as unhealthy and related alert is raised, but it is not clear
what is the real problem and how huge the impact is.

Comment 2 Daniel Horák 2018-08-20 09:08:29 UTC

Created attachment 1477108 [details]
No Volumes Detected (volumes disappear from WA Volumes page).

Comment 3 Daniel Horák 2018-08-20 09:13:43 UTC

Created attachment 1477113 [details]
Grafana Cluster dashboard: Unhealthy state, no volumes, alll Bricks UP

On Grafana dashboard:
a) cluster is properly marked as UnHealthy
b) Volumes count says Total and Down 0 (should be 2, in this case)
c) Bricks count says Total 30, Up 30 (should be Up 0, Down 30)

Comment 4 Daniel Horák 2018-08-20 09:17:25 UTC

Created attachment 1477115 [details]
Grafana Volume dashboard: Health N/A, all Bricks Up

Grafana Volume dashboard:
d) Health is reported as N/A
e) Bricks count report Total 18, Up 18 (should be Up 0, Down 18)
f) All bricks are green (Started)

Comment 5 Daniel Horák 2018-08-20 09:20:31 UTC

Created attachment 1477116 [details]
Grafana Brick Dashboard: Status Started

Grafana Brick Dashboard:
g) brick Status is reported as Started (should be Stopped/Down)

Comment 7 Martin Bukatovic 2018-08-21 08:46:29 UTC

To evaluate impact of this BZ, qe was asked to test:

* what happens when the nodes are up again
* what happens when one node only is up again

Comment 8 Daniel Horák 2018-08-22 07:08:32 UTC

(In reply to Martin Bukatovic from comment #7)
> To evaluate impact of this BZ, qe was asked to test:
> 
> * what happens when the nodes are up again

When (nearly) all gluster nodes are properly started, all the required data are populated to Grafana and Tendrl and the overview is in consistent and healthy state.

> * what happens when one node only is up again

Starting just one storage node leads to population of some data to Grafana and some alerts to Tendrl, but the overall overview is not fully consistent.
I would consider this as quite expected state, because also Gluster cluster with only one running node (from total of 6 in my case) is not really in consistent and usable state.

Comment 9 Shubhendu Tripathi 2018-11-19 06:15:44 UTC

Daniel/Martin, this is expected behavior as per RHGS-WA. In that scenario should we go ahead and close this a NOTABUG. Add your comments.

Note You need to log in before you can comment on or make changes to this bug.