1327017 – Cluster Quorum Status is flagged 'critical' and continues to remain so, in spite of a healthy state of the cluster

Bug 1327017 - Cluster Quorum Status is flagged 'critical' and continues to remain so, in spite of a healthy state of the cluster

Summary: Cluster Quorum Status is flagged 'critical' and continues to remain so, in sp...

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	gluster-nagios-addons
Sub Component:
Version:	rhgs-3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Sahina Bose
QA Contact:	Sweta Anandpara
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1311843
TreeView+	depends on / blocked

Reported:	2016-04-14 05:57 UTC by Sweta Anandpara
Modified:	2020-12-11 12:09 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	Log messages related to quorum being regained are missed by Nagios server as it is either shutdown or has communication issues with nodes. Due to this, if Cluster Quorum status was critical prior to connection issues, then it continues to remain so. Workaround: Administrator can check the alert from the Nagios UI and once the quorum is regained, the plugin result can be manually changed using "Submit passive check result for this service" option from the service page.
Clone Of:
Environment:
Last Closed:	2018-01-30 11:11:53 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
screenshot of nagios UI (215.22 KB, image/png) 2016-04-14 05:57 UTC, Sweta Anandpara	no flags	Details
View All

Description Sweta Anandpara 2016-04-14 05:57:29 UTC

Created attachment 1147043 [details]
screenshot of nagios UI

Description of problem:
Hit the above issue on the build 3.7.9-1. Imported an existing cluster into RHSC and exported it back. All the other services and hosts' status returned back to normal, but for Cluster- Quorum status, which continues to be in RED.

Version-Release number of selected component (if applicable):
glusterfs-3.7.9-1.el7rhgs.x86_64
gluster-nagios-common-0.2.3-1.el7rhgs.noarch
gluster-nagios-addons-0.2.5-1.el7rhgs.x86_64

How reproducible: 1:1


Steps to Reproduce:
1. Have a 4node cluster, with Nagios server on one of the existing RHGS nodes.
2. Import it into RHSC. Verify that the nagios server UI no longer works.
3. Export the cluster back, as a stand alone cluster and restart nagios/nrpe
4. Verify if the nagios server UI is shown correctly, with the services/hosts' status returning back to normal (or as how it is meant to be, based on the state of the cluster in the back end)


Actual results:
In Step4, all the services/hosts' status eventually returns back to GREEN but for 'Cluster - Quorum status' which continues to be in RED. Reboot of the cluster, auto-config again, setting the 'cluster.server-quorum-type' of the volumes to 'server' again - nothing seems to be effective in getting back the state of 'Cluster - Quorum' to healthy.

Expected results:
Cluster- Quroum status should be shown as GREEN if no problem exists related to quorum.

Additional info:

root@dhcp47-188 ~]# 
[root@dhcp47-188 ~]# rpm -qa | grep gluster-nagios
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
gluster-nagios-addons-0.2.6-1.el7rhgs.x86_64
[root@dhcp47-188 ~]# rpm -qa | grep gluster
glusterfs-api-3.7.9-1.el7rhgs.x86_64
glusterfs-libs-3.7.9-1.el7rhgs.x86_64
glusterfs-api-devel-3.7.9-1.el7rhgs.x86_64
vdsm-gluster-4.16.30-1.3.el7rhgs.noarch
glusterfs-3.7.9-1.el7rhgs.x86_64
glusterfs-cli-3.7.9-1.el7rhgs.x86_64
glusterfs-geo-replication-3.7.9-1.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-client-xlators-3.7.9-1.el7rhgs.x86_64
glusterfs-server-3.7.9-1.el7rhgs.x86_64
glusterfs-rdma-3.7.9-1.el7rhgs.x86_64
glusterfs-devel-3.7.9-1.el7rhgs.x86_64
gluster-nagios-addons-0.2.6-1.el7rhgs.x86_64
glusterfs-fuse-3.7.9-1.el7rhgs.x86_64
[root@dhcp47-188 ~]# 
[root@dhcp47-188 ~]# 
[root@dhcp47-188 ~]# 
[root@dhcp47-188 ~]# 
[root@dhcp47-188 ~]# gluster peer status
Number of Peers: 3

Hostname: 10.70.46.193
Uuid: f8e7ae42-da4f-4691-85b6-96a03aebd511
State: Peer in Cluster (Connected)

Hostname: 10.70.46.187
Uuid: 3e437522-f4f8-4bb5-9261-6a104cb60a45
State: Peer in Cluster (Connected)

Hostname: 10.70.46.215
Uuid: 763002c8-ecf8-4f13-9107-2e3410e10f0c
State: Peer in Cluster (Connected)
[root@dhcp47-188 ~]# 
[root@dhcp47-188 ~]# 
[root@dhcp47-188 ~]# gluster v info
 
Volume Name: nash
Type: Distributed-Replicate
Volume ID: 86241d2a-68a9-4547-a105-99282922aea2
Status: Started
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.47.188:/rhs/brick3/nash
Brick2: 10.70.46.193:/rhs/brick3/nash
Brick3: 10.70.46.187:/rhs/brick3/nash
Brick4: 10.70.47.188:/rhs/brick4/nash
Brick5: 10.70.46.193:/rhs/brick4/nash
Brick6: 10.70.46.187:/rhs/brick4/nash
Options Reconfigured:
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
user.smb: enable
performance.readdir-ahead: on
cluster.server-quorum-type: server
 
Volume Name: ozone
Type: Disperse
Volume ID: 6ed877ea-f06d-49ba-813b-43d8e5092aa3
Status: Started
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.47.188:/rhs/brick1/ozone
Brick2: 10.70.46.215:/rhs/brick1/ozone
Brick3: 10.70.46.193:/rhs/brick1/ozone
Brick4: 10.70.46.187:/rhs/brick1/ozone
Brick5: 10.70.47.188:/rhs/brick2/ozone
Brick6: 10.70.46.215:/rhs/brick2/ozone
Options Reconfigured:
features.inode-quota: off
features.quota: off
performance.readdir-ahead: on
cluster.server-quorum-type: server
[root@dhcp47-188 ~]#

Comment 2 Sahina Bose 2016-04-19 07:31:15 UTC

Was the nsca port (5667) on the node running nagios server open? 

Please note that when you add the nodes to RHSC, firewall is reconfigured by RHSC - hence the nsca port may not have been open.

In step 3 - "export the cluster back" - what was done? was it removal of cluster from RHSC?

Comment 4 Sahina Bose 2016-04-20 06:15:41 UTC

This is a corner case, where Nagios server was shut down while quorum status was critical and missed the messages when quorum returned to normal (as the nagios was not running)

Change will be required to the active freshness check to handle critical case too.
Moving this out of 3.1.3

Comment 5 Sweta Anandpara 2016-04-20 06:55:16 UTC

The user continues to get multiple service alert mails mentioning 'cluster quorum state is critical'. Is there a work around, once a user lands up in this state? 

True, it would be rare for a user to experience this, but it would be good to have some procedure/steps to get the nagios UI to show the healthy state of the cluster.

Comment 6 Sahina Bose 2016-04-22 11:17:05 UTC

Nagios has a way to acknowledge alerts - you can do this from the services page in Nagios UI.
Once acknowledged, the alerts are generated only on further state changes.
To reset the plugin status , an administrator can override status using feature - "Submit passive check result for this service" from service page

Comment 7 Sweta Anandpara 2016-05-02 08:16:51 UTC

Doc text looks good to me.

Comment 8 Sahina Bose 2018-01-30 11:11:53 UTC

Thank you for your report. However, this bug is being closed as it's logged against gluster-nagios monitoring for which no further new development is being undertaken.

Note You need to log in before you can comment on or make changes to this bug.