Description of problem: When all gluster* daemons on all storage nodes are stopped (or crash), WA doesn't reflect this state properly. * The cluster is correctly moved to unhealthy state, but without any clearer explanation what is the reason. * All volumes disappear from WA Volumes page, but they remain in Grafana and they are reported as Up. * All bricks are reported as Up/Started in Grafana. Version-Release number of selected component (if applicable): RHGS WA Server: Red Hat Enterprise Linux Server release 7.5 (Maipo) carbon-selinux-1.5.4-2.el7rhgs.noarch collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 etcd-3.2.7-1.el7.x86_64 grafana-4.3.2-3.el7rhgs.x86_64 libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 python-carbon-0.9.15-2.1.el7rhgs.noarch python-etcd-0.4.5-2.el7rhgs.noarch rubygem-etcd-0.3.0-2.el7rhgs.noarch tendrl-ansible-1.6.3-6.el7rhgs.noarch tendrl-api-1.6.3-5.el7rhgs.noarch tendrl-api-httpd-1.6.3-5.el7rhgs.noarch tendrl-commons-1.6.3-12.el7rhgs.noarch tendrl-grafana-plugins-1.6.3-10.el7rhgs.noarch tendrl-grafana-selinux-1.5.4-2.el7rhgs.noarch tendrl-monitoring-integration-1.6.3-10.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-notifier-1.6.3-4.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch tendrl-ui-1.6.3-10.el7rhgs.noarch Gluster Storage Server: Red Hat Enterprise Linux Server release 7.5 (Maipo) Red Hat Gluster Storage Server 3.4.0 collectd-5.7.2-3.1.el7rhgs.x86_64 collectd-ping-5.7.2-3.1.el7rhgs.x86_64 glusterfs-3.12.2-16.el7rhgs.x86_64 glusterfs-api-3.12.2-16.el7rhgs.x86_64 glusterfs-cli-3.12.2-16.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-16.el7rhgs.x86_64 glusterfs-events-3.12.2-16.el7rhgs.x86_64 glusterfs-fuse-3.12.2-16.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-16.el7rhgs.x86_64 glusterfs-libs-3.12.2-16.el7rhgs.x86_64 glusterfs-rdma-3.12.2-16.el7rhgs.x86_64 glusterfs-server-3.12.2-16.el7rhgs.x86_64 gluster-nagios-addons-0.2.10-2.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch libcollectdclient-5.7.2-3.1.el7rhgs.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-14.el7_5.6.x86_64 python2-gluster-3.12.2-16.el7rhgs.x86_64 python-etcd-0.4.5-2.el7rhgs.noarch tendrl-collectd-selinux-1.5.4-2.el7rhgs.noarch tendrl-commons-1.6.3-12.el7rhgs.noarch tendrl-gluster-integration-1.6.3-9.el7rhgs.noarch tendrl-node-agent-1.6.3-10.el7rhgs.noarch tendrl-selinux-1.5.4-2.el7rhgs.noarch vdsm-gluster-4.19.43-2.3.el7rhgs.noarch How reproducible: 100% Steps to Reproduce: 1. Install and configure Gluster Storage Cluster with at least one volume. 2. Install and configure RHGS WA. 3. Import Gluster Storage Cluster into RHGS WA. 4. Wait few minutes, to synchronize all the required information. 5. Stop/kill all gluster processes on all storage nodes: # systemctl stop 'gluster*' # pkill glusterfs 6. Wait few minutes and watch various pages of RHGS WA and Grafana Actual results: * following alert is raised: Cluster .... moved to unhealthy state * cluster is in Unhealthy state - both on WA Clusters page and in Grafana * all volumes disappear from WA Volumes page * all bricks are marked as Up/Started in Grafana Expected results: * it should be clearly visible, what is the real problem/that the impact of the issue on Gluster site is huge, * volumes probably shouldn't disappear from WA and should be marked as down both in WA and on Grafana dashboard, * bricks shouldn't be reported as Up/Started Additional info: See attached screenshots.
Created attachment 1477107 [details] Cluster moved to unhealthy state. Cluster is marked as unhealthy and related alert is raised, but it is not clear what is the real problem and how huge the impact is.
Created attachment 1477108 [details] No Volumes Detected (volumes disappear from WA Volumes page).
Created attachment 1477113 [details] Grafana Cluster dashboard: Unhealthy state, no volumes, alll Bricks UP On Grafana dashboard: a) cluster is properly marked as UnHealthy b) Volumes count says Total and Down 0 (should be 2, in this case) c) Bricks count says Total 30, Up 30 (should be Up 0, Down 30)
Created attachment 1477115 [details] Grafana Volume dashboard: Health N/A, all Bricks Up Grafana Volume dashboard: d) Health is reported as N/A e) Bricks count report Total 18, Up 18 (should be Up 0, Down 18) f) All bricks are green (Started)
Created attachment 1477116 [details] Grafana Brick Dashboard: Status Started Grafana Brick Dashboard: g) brick Status is reported as Started (should be Stopped/Down)
To evaluate impact of this BZ, qe was asked to test: * what happens when the nodes are up again * what happens when one node only is up again
(In reply to Martin Bukatovic from comment #7) > To evaluate impact of this BZ, qe was asked to test: > > * what happens when the nodes are up again When (nearly) all gluster nodes are properly started, all the required data are populated to Grafana and Tendrl and the overview is in consistent and healthy state. > * what happens when one node only is up again Starting just one storage node leads to population of some data to Grafana and some alerts to Tendrl, but the overall overview is not fully consistent. I would consider this as quite expected state, because also Gluster cluster with only one running node (from total of 6 in my case) is not really in consistent and usable state.
Daniel/Martin, this is expected behavior as per RHGS-WA. In that scenario should we go ahead and close this a NOTABUG. Add your comments.