1452069 – [CTDB] Nagios shows ctdb service as CRITICAL even when all nodes are healthy

Bug 1452069 - [CTDB] Nagios shows ctdb service as CRITICAL even when all nodes are healthy

Summary: [CTDB] Nagios shows ctdb service as CRITICAL even when all nodes are healthy

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	ctdb
Sub Component:
Version:	rhgs-3.3
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.3.0
Assignee:	Anoop C S
QA Contact:	Sweta Anandpara
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1417151
TreeView+	depends on / blocked

Reported:	2017-05-18 09:30 UTC by Sweta Anandpara
Modified:	2017-09-21 04:47 UTC (History)
CC List:	6 users (show)
Fixed In Version:	samba-4.6.3-2.el7rhgs
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-21 04:47:10 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:2780	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage 3.3.0 samba bug fixes and enhancement update	2017-09-21 08:17:05 UTC
Samba Project	12802	0	None	None	None	2017-05-24 09:06:08 UTC

Description Sweta Anandpara 2017-05-18 09:30:30 UTC

Description of problem:
=======================
On a 4node cluster with nagios configured and ctdb enabled, the ctdb service shows as 'CRITICAL' with the message 'Node status: nodes:4'. There seems to be a change in output format of 'ctdb status/nodestatus' command, which the nagios is not able to correctly parse, resulting in it showing as 'critical' even when all is fine.

Version-Release number of selected component (if applicable):
==============================================================
glusterfs-3.8-4-25
nagios-server-addons-0.2.6-1
gluster-nagios-common-0.2.4-1
gluster-nagios-addons-0.2.8-1

How reproducible:
=================
1:1


Additional info:
=================

[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# gluster peer status
Number of Peers: 3

Hostname: dhcp46-181.lab.eng.blr.redhat.com
Uuid: 7dd60909-3f7f-4e64-a6eb-4fced5b9aa98
State: Peer in Cluster (Connected)

Hostname: dhcp46-47.lab.eng.blr.redhat.com
Uuid: 8ddacadf-24cc-4631-8742-318995b55f3b
State: Peer in Cluster (Connected)

Hostname: dhcp47-140.lab.eng.blr.redhat.com
Uuid: 2e262e06-d728-4ed6-9375-0c5ae72379af
State: Peer in Cluster (Connected)
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# gluster v list
ctdb
saturday-saturday
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# gluster v info
 
Volume Name: ctdb
Type: Replicate
Volume ID: 64980fe9-85ea-487e-8d0d-39b70c8626b0
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 4 = 4
Transport-type: tcp
Bricks:
Brick1: 10.70.47.127:/bricks/brick8/ctdb
Brick2: 10.70.46.181:/bricks/brick8/ctdb
Brick3: 10.70.46.47:/bricks/brick8/ctdb
Brick4: 10.70.47.140:/bricks/brick8/ctdb
Options Reconfigured:
nfs.disable: on
transport.address-family: inet
 
Volume Name: saturday-saturday
Type: Distributed-Replicate
Volume ID: 4a24c34c-1144-4f07-9763-6e232c037a67
Status: Started
Snapshot Count: 2
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: dhcp47-127.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick0
Brick2: dhcp46-181.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick1
Brick3: dhcp46-47.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick2
Brick4: dhcp47-140.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick3
Brick5: dhcp47-127.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick4
Brick6: dhcp46-181.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick5
Brick7: dhcp46-47.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick6
Brick8: dhcp47-140.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick7
Brick9: dhcp47-127.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick8
Brick10: dhcp46-181.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick9
Brick11: dhcp46-47.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick10
Brick12: dhcp47-140.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick11
Options Reconfigured:
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.nl-cache: on
features.barrier: disable
features.show-snapshot-directory: enable
features.uss: enable
transport.address-family: inet
nfs.disable: on
server.allow-insecure: on
performance.stat-prefetch: on
storage.batch-fsync-delay-usec: 0
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 50000
performance.cache-samba-metadata: on
performance.parallel-readdir: on
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# ctdb ip
Public IPs on node 0
10.70.44.154 0
10.70.44.155 2
10.70.44.156 1
10.70.44.157 3
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# ctdb status
Number of nodes:4
pnn:0 10.70.47.127     OK (THIS NODE)
pnn:1 10.70.46.181     OK
pnn:2 10.70.46.47      OK
pnn:3 10.70.47.140     OK
Generation:1636868653
Size:4
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
hash:3 lmaster:3
Recovery mode:NORMAL (0)
Recovery master:3
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# ctdb nodestatus
Number of nodes:4
pnn:0 10.70.47.127     OK (THIS NODE)
pnn:1 10.70.46.181     OK
pnn:2 10.70.46.47      OK
pnn:3 10.70.47.140     OK
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# rpm -qa | egrep nagios|gluster
unrecognized word: nagios-server-addons-0.2.6-1.el7rhgs.x86_64 (position 0)
[root@dhcp47-127 ~]# rpm -qa | grep gluster
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
glusterfs-libs-3.8.4-25.el7rhgs.x86_64
glusterfs-cli-3.8.4-25.el7rhgs.x86_64
samba-vfs-glusterfs-4.6.3-0.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-25.el7rhgs.x86_64
glusterfs-server-3.8.4-25.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-debuginfo-3.8.4-24.el7rhgs.x86_64
glusterfs-api-3.8.4-25.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-25.el7rhgs.x86_64
python-gluster-3.8.4-25.el7rhgs.noarch
glusterfs-fuse-3.8.4-25.el7rhgs.x86_64
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
glusterfs-3.8.4-25.el7rhgs.x86_64
glusterfs-rdma-3.8.4-25.el7rhgs.x86_64
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# rpm -qa | grep nagios
nagios-server-addons-0.2.6-1.el7rhgs.x86_64
nagios-plugins-1.4.16-12.el7rhgs.x86_64
nagios-plugins-procs-1.4.16-12.el7rhgs.x86_64
nagios-plugins-ping-1.4.16-12.el7rhgs.x86_64
nagios-3.5.1-9.el7.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
pnp4nagios-0.6.22-3.1.el7rhgs.x86_64
nagios-plugins-ide_smart-1.4.16-12.el7rhgs.x86_64
nagios-plugins-dummy-1.4.16-12.el7rhgs.x86_64
nagios-common-3.5.1-9.el7.x86_64
nagios-plugins-nrpe-2.15-4.2.el7rhgs.x86_64
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
[root@dhcp47-127 ~]# 
[root@dhcp47-127 ~]# 


On nagios web UI:
===================
	
CTDB       CRITICAL	05-18-2017 09:28:22	0d 3h 19m 6s	3/3	Node status: nodes:4 


Current Status:	  CRITICAL   (for 0d 3h 19m 45s)
Status Information:	Node status: nodes:4
Performance Data:	
Current Attempt:	3/3  (HARD state)
Last Check Time:	05-18-2017 09:29:24
Check Type:	PASSIVE
Check Latency / Duration:	N/A / 0.000 seconds
Next Scheduled Check:  	05-19-2017 06:30:43
Last State Change:	05-18-2017 06:09:45
Last Notification:	05-18-2017 08:12:35 (notification 2)
Is This Service Flapping?	  NO   (0.00% state change)
In Scheduled Downtime?	  NO  
Last Update:	05-18-2017 09:29:26  ( 0d 0h 0m 4s ago)
Active Checks:	  ENABLED  
Passive Checks:	  ENABLED  
Obsessing:	  ENABLED  
Notifications:	  ENABLED  
Event Handler:	  ENABLED  
Flap Detection:	  ENABLED

Comment 2 Anoop C S 2017-05-24 09:06:08 UTC

Hi Sweta,

Good catch.

It's a bug in ctdb for which I have raised an upstream bug. This sneaked in while re-factoring ctdb code in v4.5.

Judging from the nagios messages, I assume that it is trying to parse the displayed output of `ctdb nodestatus`. Instead it can also rely on the exit status after running `ctdb nodestatus` which reflects the health status of current node.

Comment 4 Anoop C S 2017-05-26 10:26:48 UTC

This has been fixed upstream via the following commits:

[1] https://git.samba.org/?p=samba.git;a=commit;h=a600d467e2842ab05e429c5a67be5b222ddd1c12

[2] https://git.samba.org/?p=samba.git;a=commit;h=1d10c8e9e637619b754b4a273d3c714fbca7d503

[3] https://git.samba.org/?p=samba.git;a=commit;h=ade535371b86294c12ca3f7eb98d8ef7ecd29caa

Comment 8 Anoop C S 2017-05-30 11:44:41 UTC

Following the discussion we had in our team, moving the component to ctdb as it is a regression from previous version.

Comment 11 Sweta Anandpara 2017-06-22 09:33:49 UTC

Tested and verified this on the build samba-4.6.3-3.el7rhgs.x86_64, ctdb-4.6.3-3.el7rhgs.x86_64 and glusterfs-3.8.4-28.el7rhgs.x86_64

Ctdb nodestatus command (on CLI) gives the output of ctdb status only of the localhost (as expected). Nagios when configured displays the ctdb service as OK.

Moving this to verified in 3.3.

[root@dhcp47-121 ~]# ctdb status
Number of nodes:6
pnn:0 10.70.47.113     OK
pnn:1 10.70.47.114     OK
pnn:2 10.70.47.115     OK
pnn:3 10.70.47.116     OK
pnn:4 10.70.47.117     OK
pnn:5 10.70.47.121     OK (THIS NODE)
Generation:368498462
Size:6
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
hash:3 lmaster:3
hash:4 lmaster:4
hash:5 lmaster:5
Recovery mode:NORMAL (0)
Recovery master:4
[root@dhcp47-121 ~]# ctdb nodestatus
pnn:5 10.70.47.121     OK (THIS NODE)
[root@dhcp47-121 ~]#

Comment 13 errata-xmlrpc 2017-09-21 04:47:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2780

Note You need to log in before you can comment on or make changes to this bug.