Bug 1452069 - [CTDB] Nagios shows ctdb service as CRITICAL even when all nodes are healthy
Summary: [CTDB] Nagios shows ctdb service as CRITICAL even when all nodes are healthy
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: ctdb
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: RHGS 3.3.0
Assignee: Anoop C S
QA Contact: Sweta Anandpara
URL:
Whiteboard:
Depends On:
Blocks: 1417151
TreeView+ depends on / blocked
 
Reported: 2017-05-18 09:30 UTC by Sweta Anandpara
Modified: 2017-09-21 04:47 UTC (History)
6 users (show)

Fixed In Version: samba-4.6.3-2.el7rhgs
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-09-21 04:47:10 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:2780 0 normal SHIPPED_LIVE Red Hat Gluster Storage 3.3.0 samba bug fixes and enhancement update 2017-09-21 08:17:05 UTC
Samba Project 12802 0 None None None 2017-05-24 09:06:08 UTC

Description Sweta Anandpara 2017-05-18 09:30:30 UTC
Description of problem:
=======================
On a 4node cluster with nagios configured and ctdb enabled, the ctdb service shows as 'CRITICAL' with the message 'Node status: nodes:4'. There seems to be a change in output format of 'ctdb status/nodestatus' command, which the nagios is not able to correctly parse, resulting in it showing as 'critical' even when all is fine.

Version-Release number of selected component (if applicable):
==============================================================
glusterfs-3.8-4-25
nagios-server-addons-0.2.6-1
gluster-nagios-common-0.2.4-1
gluster-nagios-addons-0.2.8-1

How reproducible:
=================
1:1


Additional info:
=================

[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# gluster peer status
Number of Peers: 3

Hostname: dhcp46-181.lab.eng.blr.redhat.com
Uuid: 7dd60909-3f7f-4e64-a6eb-4fced5b9aa98
State: Peer in Cluster (Connected)

Hostname: dhcp46-47.lab.eng.blr.redhat.com
Uuid: 8ddacadf-24cc-4631-8742-318995b55f3b
State: Peer in Cluster (Connected)

Hostname: dhcp47-140.lab.eng.blr.redhat.com
Uuid: 2e262e06-d728-4ed6-9375-0c5ae72379af
State: Peer in Cluster (Connected)
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# gluster v list
ctdb
saturday-saturday
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# gluster v info
 
Volume Name: ctdb
Type: Replicate
Volume ID: 64980fe9-85ea-487e-8d0d-39b70c8626b0
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.47.127:/bricks/brick8/ctdb
Brick2: 10.70.46.181:/bricks/brick8/ctdb
Brick3: 10.70.46.47:/bricks/brick8/ctdb
Brick4: 10.70.47.140:/bricks/brick8/ctdb
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
 
Volume Name: saturday-saturday
Type: Distributed-Replicate
Volume ID: 4a24c34c-1144-4f07-9763-6e232c037a67
Status: Started
Snapshot Count: 2
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: dhcp47-127.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick0
Brick2: dhcp46-181.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick1
Brick3: dhcp46-47.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick2
Brick4: dhcp47-140.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick3
Brick5: dhcp47-127.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick4
Brick6: dhcp46-181.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick5
Brick7: dhcp46-47.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick6
Brick8: dhcp47-140.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick7
Brick9: dhcp47-127.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick8
Brick10: dhcp46-181.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick9
Brick11: dhcp46-47.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick10
Brick12: dhcp47-140.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick11
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.nl-cache: on
features.barrier: disable
features.show-snapshot-directory: enable
features.uss: enable
transport.address-family: inet
nfs.disable: on
server.allow-insecure: on
performance.stat-prefetch: on
storage.batch-fsync-delay-usec: 0
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 50000
performance.cache-samba-metadata: on
performance.parallel-readdir: on
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# ctdb ip
Public IPs on node 0
10.70.44.154 0
10.70.44.155 2
10.70.44.156 1
10.70.44.157 3
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# ctdb status
Number of nodes:4
pnn:0 10.70.47.127     OK (THIS NODE)
pnn:1 10.70.46.181     OK
pnn:2 10.70.46.47      OK
pnn:3 10.70.47.140     OK
Generation:1636868653
Size:4
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
hash:3 lmaster:3
Recovery mode:NORMAL (0)
Recovery master:3
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# ctdb nodestatus
Number of nodes:4
pnn:0 10.70.47.127     OK (THIS NODE)
pnn:1 10.70.46.181     OK
pnn:2 10.70.46.47      OK
pnn:3 10.70.47.140     OK
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# rpm -qa | egrep nagios|gluster
unrecognized word: nagios-server-addons-0.2.6-1.el7rhgs.x86_64 (position 0)
[root@dhcp47-127 ~]# rpm -qa | grep gluster
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
glusterfs-libs-3.8.4-25.el7rhgs.x86_64
glusterfs-cli-3.8.4-25.el7rhgs.x86_64
samba-vfs-glusterfs-4.6.3-0.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-25.el7rhgs.x86_64
glusterfs-server-3.8.4-25.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-debuginfo-3.8.4-24.el7rhgs.x86_64
glusterfs-api-3.8.4-25.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-25.el7rhgs.x86_64
python-gluster-3.8.4-25.el7rhgs.noarch
glusterfs-fuse-3.8.4-25.el7rhgs.x86_64
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
glusterfs-3.8.4-25.el7rhgs.x86_64
glusterfs-rdma-3.8.4-25.el7rhgs.x86_64
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# rpm -qa | grep nagios
nagios-server-addons-0.2.6-1.el7rhgs.x86_64
nagios-plugins-1.4.16-12.el7rhgs.x86_64
nagios-plugins-procs-1.4.16-12.el7rhgs.x86_64
nagios-plugins-ping-1.4.16-12.el7rhgs.x86_64
nagios-3.5.1-9.el7.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
pnp4nagios-0.6.22-3.1.el7rhgs.x86_64
nagios-plugins-ide_smart-1.4.16-12.el7rhgs.x86_64
nagios-plugins-dummy-1.4.16-12.el7rhgs.x86_64
nagios-common-3.5.1-9.el7.x86_64
nagios-plugins-nrpe-2.15-4.2.el7rhgs.x86_64
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# 


On nagios web UI:
===================
	
CTDB       CRITICAL	05-18-2017 09:28:22	0d 3h 19m 6s	3/3	Node status: nodes:4 


Current Status:	  CRITICAL   (for 0d 3h 19m 45s)
Status Information:	Node status: nodes:4
Performance Data:	
Current Attempt:	3/3  (HARD state)
Last Check Time:	05-18-2017 09:29:24
Check Type:	PASSIVE
Check Latency / Duration:	N/A / 0.000 seconds
Next Scheduled Check:  	05-19-2017 06:30:43
Last State Change:	05-18-2017 06:09:45
Last Notification:	05-18-2017 08:12:35 (notification 2)
Is This Service Flapping?	  NO   (0.00% state change)
In Scheduled Downtime?	  NO  
Last Update:	05-18-2017 09:29:26  ( 0d 0h 0m 4s ago)
Active Checks:	  ENABLED  
Passive Checks:	  ENABLED  
Obsessing:	  ENABLED  
Notifications:	  ENABLED  
Event Handler:	  ENABLED  
Flap Detection:	  ENABLED

Comment 2 Anoop C S 2017-05-24 09:06:08 UTC
Hi Sweta,

Good catch.

It's a bug in ctdb for which I have raised an upstream bug. This sneaked in while re-factoring ctdb code in v4.5.

Judging from the nagios messages, I assume that it is trying to parse the displayed output of `ctdb nodestatus`. Instead it can also rely on the exit status after running `ctdb nodestatus` which reflects the health status of current node.

Comment 8 Anoop C S 2017-05-30 11:44:41 UTC
Following the discussion we had in our team, moving the component to ctdb as it is a regression from previous version.

Comment 11 Sweta Anandpara 2017-06-22 09:33:49 UTC
Tested and verified this on the build samba-4.6.3-3.el7rhgs.x86_64, ctdb-4.6.3-3.el7rhgs.x86_64 and glusterfs-3.8.4-28.el7rhgs.x86_64

Ctdb nodestatus command (on CLI) gives the output of ctdb status only of the localhost (as expected). Nagios when configured displays the ctdb service as OK.

Moving this to verified in 3.3.

[root@dhcp47-121 ~]# ctdb status
Number of nodes:6
pnn:0 10.70.47.113     OK
pnn:1 10.70.47.114     OK
pnn:2 10.70.47.115     OK
pnn:3 10.70.47.116     OK
pnn:4 10.70.47.117     OK
pnn:5 10.70.47.121     OK (THIS NODE)
Generation:368498462
Size:6
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
hash:3 lmaster:3
hash:4 lmaster:4
hash:5 lmaster:5
Recovery mode:NORMAL (0)
Recovery master:4
[root@dhcp47-121 ~]# ctdb nodestatus
pnn:5 10.70.47.121     OK (THIS NODE)
[root@dhcp47-121 ~]#

Comment 13 errata-xmlrpc 2017-09-21 04:47:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2780


Note You need to log in before you can comment on or make changes to this bug.