Created attachment 1147043 [details] screenshot of nagios UI Description of problem: Hit the above issue on the build 3.7.9-1. Imported an existing cluster into RHSC and exported it back. All the other services and hosts' status returned back to normal, but for Cluster- Quorum status, which continues to be in RED. Version-Release number of selected component (if applicable): glusterfs-3.7.9-1.el7rhgs.x86_64 gluster-nagios-common-0.2.3-1.el7rhgs.noarch gluster-nagios-addons-0.2.5-1.el7rhgs.x86_64 How reproducible: 1:1 Steps to Reproduce: 1. Have a 4node cluster, with Nagios server on one of the existing RHGS nodes. 2. Import it into RHSC. Verify that the nagios server UI no longer works. 3. Export the cluster back, as a stand alone cluster and restart nagios/nrpe 4. Verify if the nagios server UI is shown correctly, with the services/hosts' status returning back to normal (or as how it is meant to be, based on the state of the cluster in the back end) Actual results: In Step4, all the services/hosts' status eventually returns back to GREEN but for 'Cluster - Quorum status' which continues to be in RED. Reboot of the cluster, auto-config again, setting the 'cluster.server-quorum-type' of the volumes to 'server' again - nothing seems to be effective in getting back the state of 'Cluster - Quorum' to healthy. Expected results: Cluster- Quroum status should be shown as GREEN if no problem exists related to quorum. Additional info: root@dhcp47-188 ~]# [root@dhcp47-188 ~]# rpm -qa | grep gluster-nagios gluster-nagios-common-0.2.4-1.el7rhgs.noarch gluster-nagios-addons-0.2.6-1.el7rhgs.x86_64 [root@dhcp47-188 ~]# rpm -qa | grep gluster glusterfs-api-3.7.9-1.el7rhgs.x86_64 glusterfs-libs-3.7.9-1.el7rhgs.x86_64 glusterfs-api-devel-3.7.9-1.el7rhgs.x86_64 vdsm-gluster-4.16.30-1.3.el7rhgs.noarch glusterfs-3.7.9-1.el7rhgs.x86_64 glusterfs-cli-3.7.9-1.el7rhgs.x86_64 glusterfs-geo-replication-3.7.9-1.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-client-xlators-3.7.9-1.el7rhgs.x86_64 glusterfs-server-3.7.9-1.el7rhgs.x86_64 glusterfs-rdma-3.7.9-1.el7rhgs.x86_64 glusterfs-devel-3.7.9-1.el7rhgs.x86_64 gluster-nagios-addons-0.2.6-1.el7rhgs.x86_64 glusterfs-fuse-3.7.9-1.el7rhgs.x86_64 [root@dhcp47-188 ~]# [root@dhcp47-188 ~]# [root@dhcp47-188 ~]# [root@dhcp47-188 ~]# [root@dhcp47-188 ~]# gluster peer status Number of Peers: 3 Hostname: 10.70.46.193 Uuid: f8e7ae42-da4f-4691-85b6-96a03aebd511 State: Peer in Cluster (Connected) Hostname: 10.70.46.187 Uuid: 3e437522-f4f8-4bb5-9261-6a104cb60a45 State: Peer in Cluster (Connected) Hostname: 10.70.46.215 Uuid: 763002c8-ecf8-4f13-9107-2e3410e10f0c State: Peer in Cluster (Connected) [root@dhcp47-188 ~]# [root@dhcp47-188 ~]# [root@dhcp47-188 ~]# gluster v info Volume Name: nash Type: Distributed-Replicate Volume ID: 86241d2a-68a9-4547-a105-99282922aea2 Status: Started Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: 10.70.47.188:/rhs/brick3/nash Brick2: 10.70.46.193:/rhs/brick3/nash Brick3: 10.70.46.187:/rhs/brick3/nash Brick4: 10.70.47.188:/rhs/brick4/nash Brick5: 10.70.46.193:/rhs/brick4/nash Brick6: 10.70.46.187:/rhs/brick4/nash Options Reconfigured: features.quota-deem-statfs: on features.inode-quota: on features.quota: on user.smb: enable performance.readdir-ahead: on cluster.server-quorum-type: server Volume Name: ozone Type: Disperse Volume ID: 6ed877ea-f06d-49ba-813b-43d8e5092aa3 Status: Started Number of Bricks: 1 x (4 + 2) = 6 Transport-type: tcp Bricks: Brick1: 10.70.47.188:/rhs/brick1/ozone Brick2: 10.70.46.215:/rhs/brick1/ozone Brick3: 10.70.46.193:/rhs/brick1/ozone Brick4: 10.70.46.187:/rhs/brick1/ozone Brick5: 10.70.47.188:/rhs/brick2/ozone Brick6: 10.70.46.215:/rhs/brick2/ozone Options Reconfigured: features.inode-quota: off features.quota: off performance.readdir-ahead: on cluster.server-quorum-type: server [root@dhcp47-188 ~]#
Was the nsca port (5667) on the node running nagios server open? Please note that when you add the nodes to RHSC, firewall is reconfigured by RHSC - hence the nsca port may not have been open. In step 3 - "export the cluster back" - what was done? was it removal of cluster from RHSC?
This is a corner case, where Nagios server was shut down while quorum status was critical and missed the messages when quorum returned to normal (as the nagios was not running) Change will be required to the active freshness check to handle critical case too. Moving this out of 3.1.3
The user continues to get multiple service alert mails mentioning 'cluster quorum state is critical'. Is there a work around, once a user lands up in this state? True, it would be rare for a user to experience this, but it would be good to have some procedure/steps to get the nagios UI to show the healthy state of the cluster.
Nagios has a way to acknowledge alerts - you can do this from the services page in Nagios UI. Once acknowledged, the alerts are generated only on further state changes. To reset the plugin status , an administrator can override status using feature - "Submit passive check result for this service" from service page
Doc text looks good to me.
Thank you for your report. However, this bug is being closed as it's logged against gluster-nagios monitoring for which no further new development is being undertaken.