Log messages related to quorum being regained are missed by Nagios server as it is either shutdown or has communication issues with nodes. Due to this, if Cluster Quorum status was critical prior to connection issues, then it continues to remain so.
Workaround: Administrator can check the alert from the Nagios UI and once the quorum is regained, the plugin result can be manually changed using "Submit passive check result for this service" option from the service page.
DescriptionSweta Anandpara
2016-04-14 05:57:29 UTC
Created attachment 1147043[details]
screenshot of nagios UI
Description of problem:
Hit the above issue on the build 3.7.9-1. Imported an existing cluster into RHSC and exported it back. All the other services and hosts' status returned back to normal, but for Cluster- Quorum status, which continues to be in RED.
Version-Release number of selected component (if applicable):
glusterfs-3.7.9-1.el7rhgs.x86_64
gluster-nagios-common-0.2.3-1.el7rhgs.noarch
gluster-nagios-addons-0.2.5-1.el7rhgs.x86_64
How reproducible: 1:1
Steps to Reproduce:
1. Have a 4node cluster, with Nagios server on one of the existing RHGS nodes.
2. Import it into RHSC. Verify that the nagios server UI no longer works.
3. Export the cluster back, as a stand alone cluster and restart nagios/nrpe
4. Verify if the nagios server UI is shown correctly, with the services/hosts' status returning back to normal (or as how it is meant to be, based on the state of the cluster in the back end)
Actual results:
In Step4, all the services/hosts' status eventually returns back to GREEN but for 'Cluster - Quorum status' which continues to be in RED. Reboot of the cluster, auto-config again, setting the 'cluster.server-quorum-type' of the volumes to 'server' again - nothing seems to be effective in getting back the state of 'Cluster - Quorum' to healthy.
Expected results:
Cluster- Quroum status should be shown as GREEN if no problem exists related to quorum.
Additional info:
root@dhcp47-188 ~]#
[root@dhcp47-188 ~]# rpm -qa | grep gluster-nagios
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
gluster-nagios-addons-0.2.6-1.el7rhgs.x86_64
[root@dhcp47-188 ~]# rpm -qa | grep gluster
glusterfs-api-3.7.9-1.el7rhgs.x86_64
glusterfs-libs-3.7.9-1.el7rhgs.x86_64
glusterfs-api-devel-3.7.9-1.el7rhgs.x86_64
vdsm-gluster-4.16.30-1.3.el7rhgs.noarch
glusterfs-3.7.9-1.el7rhgs.x86_64
glusterfs-cli-3.7.9-1.el7rhgs.x86_64
glusterfs-geo-replication-3.7.9-1.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-client-xlators-3.7.9-1.el7rhgs.x86_64
glusterfs-server-3.7.9-1.el7rhgs.x86_64
glusterfs-rdma-3.7.9-1.el7rhgs.x86_64
glusterfs-devel-3.7.9-1.el7rhgs.x86_64
gluster-nagios-addons-0.2.6-1.el7rhgs.x86_64
glusterfs-fuse-3.7.9-1.el7rhgs.x86_64
[root@dhcp47-188 ~]#
[root@dhcp47-188 ~]#
[root@dhcp47-188 ~]#
[root@dhcp47-188 ~]#
[root@dhcp47-188 ~]# gluster peer status
Number of Peers: 3
Hostname: 10.70.46.193
Uuid: f8e7ae42-da4f-4691-85b6-96a03aebd511
State: Peer in Cluster (Connected)
Hostname: 10.70.46.187
Uuid: 3e437522-f4f8-4bb5-9261-6a104cb60a45
State: Peer in Cluster (Connected)
Hostname: 10.70.46.215
Uuid: 763002c8-ecf8-4f13-9107-2e3410e10f0c
State: Peer in Cluster (Connected)
[root@dhcp47-188 ~]#
[root@dhcp47-188 ~]#
[root@dhcp47-188 ~]# gluster v info
Volume Name: nash
Type: Distributed-Replicate
Volume ID: 86241d2a-68a9-4547-a105-99282922aea2
Status: Started
Number of Bricks: 2 x 3 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.47.188:/rhs/brick3/nash
Brick2: 10.70.46.193:/rhs/brick3/nash
Brick3: 10.70.46.187:/rhs/brick3/nash
Brick4: 10.70.47.188:/rhs/brick4/nash
Brick5: 10.70.46.193:/rhs/brick4/nash
Brick6: 10.70.46.187:/rhs/brick4/nash
Options Reconfigured:
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
user.smb: enable
performance.readdir-ahead: on
cluster.server-quorum-type: server
Volume Name: ozone
Type: Disperse
Volume ID: 6ed877ea-f06d-49ba-813b-43d8e5092aa3
Status: Started
Number of Bricks: 1 x (4 + 2) = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.47.188:/rhs/brick1/ozone
Brick2: 10.70.46.215:/rhs/brick1/ozone
Brick3: 10.70.46.193:/rhs/brick1/ozone
Brick4: 10.70.46.187:/rhs/brick1/ozone
Brick5: 10.70.47.188:/rhs/brick2/ozone
Brick6: 10.70.46.215:/rhs/brick2/ozone
Options Reconfigured:
features.inode-quota: off
features.quota: off
performance.readdir-ahead: on
cluster.server-quorum-type: server
[root@dhcp47-188 ~]#
Was the nsca port (5667) on the node running nagios server open?
Please note that when you add the nodes to RHSC, firewall is reconfigured by RHSC - hence the nsca port may not have been open.
In step 3 - "export the cluster back" - what was done? was it removal of cluster from RHSC?
This is a corner case, where Nagios server was shut down while quorum status was critical and missed the messages when quorum returned to normal (as the nagios was not running)
Change will be required to the active freshness check to handle critical case too.
Moving this out of 3.1.3
The user continues to get multiple service alert mails mentioning 'cluster quorum state is critical'. Is there a work around, once a user lands up in this state?
True, it would be rare for a user to experience this, but it would be good to have some procedure/steps to get the nagios UI to show the healthy state of the cluster.
Nagios has a way to acknowledge alerts - you can do this from the services page in Nagios UI.
Once acknowledged, the alerts are generated only on further state changes.
To reset the plugin status , an administrator can override status using feature - "Submit passive check result for this service" from service page
Thank you for your report. However, this bug is being closed as it's logged against gluster-nagios monitoring for which no further new development is being undertaken.
Created attachment 1147043 [details] screenshot of nagios UI Description of problem: Hit the above issue on the build 3.7.9-1. Imported an existing cluster into RHSC and exported it back. All the other services and hosts' status returned back to normal, but for Cluster- Quorum status, which continues to be in RED. Version-Release number of selected component (if applicable): glusterfs-3.7.9-1.el7rhgs.x86_64 gluster-nagios-common-0.2.3-1.el7rhgs.noarch gluster-nagios-addons-0.2.5-1.el7rhgs.x86_64 How reproducible: 1:1 Steps to Reproduce: 1. Have a 4node cluster, with Nagios server on one of the existing RHGS nodes. 2. Import it into RHSC. Verify that the nagios server UI no longer works. 3. Export the cluster back, as a stand alone cluster and restart nagios/nrpe 4. Verify if the nagios server UI is shown correctly, with the services/hosts' status returning back to normal (or as how it is meant to be, based on the state of the cluster in the back end) Actual results: In Step4, all the services/hosts' status eventually returns back to GREEN but for 'Cluster - Quorum status' which continues to be in RED. Reboot of the cluster, auto-config again, setting the 'cluster.server-quorum-type' of the volumes to 'server' again - nothing seems to be effective in getting back the state of 'Cluster - Quorum' to healthy. Expected results: Cluster- Quroum status should be shown as GREEN if no problem exists related to quorum. Additional info: root@dhcp47-188 ~]# [root@dhcp47-188 ~]# rpm -qa | grep gluster-nagios gluster-nagios-common-0.2.4-1.el7rhgs.noarch gluster-nagios-addons-0.2.6-1.el7rhgs.x86_64 [root@dhcp47-188 ~]# rpm -qa | grep gluster glusterfs-api-3.7.9-1.el7rhgs.x86_64 glusterfs-libs-3.7.9-1.el7rhgs.x86_64 glusterfs-api-devel-3.7.9-1.el7rhgs.x86_64 vdsm-gluster-4.16.30-1.3.el7rhgs.noarch glusterfs-3.7.9-1.el7rhgs.x86_64 glusterfs-cli-3.7.9-1.el7rhgs.x86_64 glusterfs-geo-replication-3.7.9-1.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-client-xlators-3.7.9-1.el7rhgs.x86_64 glusterfs-server-3.7.9-1.el7rhgs.x86_64 glusterfs-rdma-3.7.9-1.el7rhgs.x86_64 glusterfs-devel-3.7.9-1.el7rhgs.x86_64 gluster-nagios-addons-0.2.6-1.el7rhgs.x86_64 glusterfs-fuse-3.7.9-1.el7rhgs.x86_64 [root@dhcp47-188 ~]# [root@dhcp47-188 ~]# [root@dhcp47-188 ~]# [root@dhcp47-188 ~]# [root@dhcp47-188 ~]# gluster peer status Number of Peers: 3 Hostname: 10.70.46.193 Uuid: f8e7ae42-da4f-4691-85b6-96a03aebd511 State: Peer in Cluster (Connected) Hostname: 10.70.46.187 Uuid: 3e437522-f4f8-4bb5-9261-6a104cb60a45 State: Peer in Cluster (Connected) Hostname: 10.70.46.215 Uuid: 763002c8-ecf8-4f13-9107-2e3410e10f0c State: Peer in Cluster (Connected) [root@dhcp47-188 ~]# [root@dhcp47-188 ~]# [root@dhcp47-188 ~]# gluster v info Volume Name: nash Type: Distributed-Replicate Volume ID: 86241d2a-68a9-4547-a105-99282922aea2 Status: Started Number of Bricks: 2 x 3 = 6 Transport-type: tcp Bricks: Brick1: 10.70.47.188:/rhs/brick3/nash Brick2: 10.70.46.193:/rhs/brick3/nash Brick3: 10.70.46.187:/rhs/brick3/nash Brick4: 10.70.47.188:/rhs/brick4/nash Brick5: 10.70.46.193:/rhs/brick4/nash Brick6: 10.70.46.187:/rhs/brick4/nash Options Reconfigured: features.quota-deem-statfs: on features.inode-quota: on features.quota: on user.smb: enable performance.readdir-ahead: on cluster.server-quorum-type: server Volume Name: ozone Type: Disperse Volume ID: 6ed877ea-f06d-49ba-813b-43d8e5092aa3 Status: Started Number of Bricks: 1 x (4 + 2) = 6 Transport-type: tcp Bricks: Brick1: 10.70.47.188:/rhs/brick1/ozone Brick2: 10.70.46.215:/rhs/brick1/ozone Brick3: 10.70.46.193:/rhs/brick1/ozone Brick4: 10.70.46.187:/rhs/brick1/ozone Brick5: 10.70.47.188:/rhs/brick2/ozone Brick6: 10.70.46.215:/rhs/brick2/ozone Options Reconfigured: features.inode-quota: off features.quota: off performance.readdir-ahead: on cluster.server-quorum-type: server [root@dhcp47-188 ~]#