Bug 1327017

Summary: Cluster Quorum Status is flagged 'critical' and continues to remain so, in spite of a healthy state of the cluster
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Sweta Anandpara <sanandpa>
Component: gluster-nagios-addonsAssignee: Sahina Bose <sabose>
Status: CLOSED CANTFIX QA Contact: Sweta Anandpara <sanandpa>
Severity: high Docs Contact:
Priority: medium    
Version: rhgs-3.1CC: bmohanra, mzywusko, olim, rhinduja, sabose, sanandpa
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
Log messages related to quorum being regained are missed by Nagios server as it is either shutdown or has communication issues with nodes. Due to this, if Cluster Quorum status was critical prior to connection issues, then it continues to remain so. Workaround: Administrator can check the alert from the Nagios UI and once the quorum is regained, the plugin result can be manually changed using "Submit passive check result for this service" option from the service page.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-01-30 11:11:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1311843    
Attachments:
Description Flags
screenshot of nagios UI none

Description Sweta Anandpara 2016-04-14 05:57:29 UTC
Created attachment 1147043 [details]
screenshot of nagios UI

Description of problem:
Hit the above issue on the build 3.7.9-1. Imported an existing cluster into RHSC and exported it back. All the other services and hosts' status returned back to normal, but for Cluster- Quorum status, which continues to be in RED.

Version-Release number of selected component (if applicable):
glusterfs-3.7.9-1.el7rhgs.x86_64
gluster-nagios-common-0.2.3-1.el7rhgs.noarch
gluster-nagios-addons-0.2.5-1.el7rhgs.x86_64

How reproducible: 1:1


Steps to Reproduce:
1. Have a 4node cluster, with Nagios server on one of the existing RHGS nodes.
2. Import it into RHSC. Verify that the nagios server UI no longer works.
3. Export the cluster back, as a stand alone cluster and restart nagios/nrpe
4. Verify if the nagios server UI is shown correctly, with the services/hosts' status returning back to normal (or as how it is meant to be, based on the state of the cluster in the back end)


Actual results:
In Step4, all the services/hosts' status eventually returns back to GREEN but for 'Cluster - Quorum status' which continues to be in RED. Reboot of the cluster, auto-config again, setting the 'cluster.server-quorum-type' of the volumes to 'server' again - nothing seems to be effective in getting back the state of 'Cluster - Quorum' to healthy.

Expected results:
Cluster- Quroum status should be shown as GREEN if no problem exists related to quorum.

Additional info:

root@dhcp47-188 ~]# 
[root@dhcp47-188 ~]# rpm -qa | grep gluster-nagios
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
gluster-nagios-addons-0.2.6-1.el7rhgs.x86_64
[root@dhcp47-188 ~]# rpm -qa | grep gluster
glusterfs-api-3.7.9-1.el7rhgs.x86_64
glusterfs-libs-3.7.9-1.el7rhgs.x86_64
glusterfs-api-devel-3.7.9-1.el7rhgs.x86_64
vdsm-gluster-4.16.30-1.3.el7rhgs.noarch
glusterfs-3.7.9-1.el7rhgs.x86_64
glusterfs-cli-3.7.9-1.el7rhgs.x86_64
glusterfs-geo-replication-3.7.9-1.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-client-xlators-3.7.9-1.el7rhgs.x86_64
glusterfs-server-3.7.9-1.el7rhgs.x86_64
glusterfs-rdma-3.7.9-1.el7rhgs.x86_64
glusterfs-devel-3.7.9-1.el7rhgs.x86_64
gluster-nagios-addons-0.2.6-1.el7rhgs.x86_64
glusterfs-fuse-3.7.9-1.el7rhgs.x86_64
[root@dhcp47-188 ~]# 
[root@dhcp47-188 ~]# 
[root@dhcp47-188 ~]# 
[root@dhcp47-188 ~]# 
[root@dhcp47-188 ~]# gluster peer status
Number of Peers: 3

Hostname: 10.70.46.193
Uuid: f8e7ae42-da4f-4691-85b6-96a03aebd511
State: Peer in Cluster (Connected)

Hostname: 10.70.46.187
Uuid: 3e437522-f4f8-4bb5-9261-6a104cb60a45
State: Peer in Cluster (Connected)

Hostname: 10.70.46.215
Uuid: 763002c8-ecf8-4f13-9107-2e3410e10f0c
State: Peer in Cluster (Connected)
[root@dhcp47-188 ~]# 
[root@dhcp47-188 ~]# 
[root@dhcp47-188 ~]# gluster v info
 
Volume Name: nash
Type: Distributed-Replicate
Volume ID: 86241d2a-68a9-4547-a105-99282922aea2
Status: Started
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.47.188:/rhs/brick3/nash
Brick2: 10.70.46.193:/rhs/brick3/nash
Brick3: 10.70.46.187:/rhs/brick3/nash
Brick4: 10.70.47.188:/rhs/brick4/nash
Brick5: 10.70.46.193:/rhs/brick4/nash
Brick6: 10.70.46.187:/rhs/brick4/nash
Options Reconfigured:
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
user.smb: enable
performance.readdir-ahead: on
cluster.server-quorum-type: server
 
Volume Name: ozone
Type: Disperse
Volume ID: 6ed877ea-f06d-49ba-813b-43d8e5092aa3
Status: Started
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.47.188:/rhs/brick1/ozone
Brick2: 10.70.46.215:/rhs/brick1/ozone
Brick3: 10.70.46.193:/rhs/brick1/ozone
Brick4: 10.70.46.187:/rhs/brick1/ozone
Brick5: 10.70.47.188:/rhs/brick2/ozone
Brick6: 10.70.46.215:/rhs/brick2/ozone
Options Reconfigured:
features.inode-quota: off
features.quota: off
performance.readdir-ahead: on
cluster.server-quorum-type: server
[root@dhcp47-188 ~]#

Comment 2 Sahina Bose 2016-04-19 07:31:15 UTC
Was the nsca port (5667) on the node running nagios server open? 

Please note that when you add the nodes to RHSC, firewall is reconfigured by RHSC - hence the nsca port may not have been open.

In step 3 - "export the cluster back" - what was done? was it removal of cluster from RHSC?

Comment 4 Sahina Bose 2016-04-20 06:15:41 UTC
This is a corner case, where Nagios server was shut down while quorum status was critical and missed the messages when quorum returned to normal (as the nagios was not running)

Change will be required to the active freshness check to handle critical case too.
Moving this out of 3.1.3

Comment 5 Sweta Anandpara 2016-04-20 06:55:16 UTC
The user continues to get multiple service alert mails mentioning 'cluster quorum state is critical'. Is there a work around, once a user lands up in this state? 

True, it would be rare for a user to experience this, but it would be good to have some procedure/steps to get the nagios UI to show the healthy state of the cluster.

Comment 6 Sahina Bose 2016-04-22 11:17:05 UTC
Nagios has a way to acknowledge alerts - you can do this from the services page in Nagios UI.
Once acknowledged, the alerts are generated only on further state changes.
To reset the plugin status , an administrator can override status using feature - "Submit passive check result for this service" from service page

Comment 7 Sweta Anandpara 2016-05-02 08:16:51 UTC
Doc text looks good to me.

Comment 8 Sahina Bose 2018-01-30 11:11:53 UTC
Thank you for your report. However, this bug is being closed as it's logged against gluster-nagios monitoring for which no further new development is being undertaken.