Fedora Account System
Red Hat Associate
Red Hat Customer
Description of problem: ======================= On a 4node cluster with nagios configured and ctdb enabled, the ctdb service shows as 'CRITICAL' with the message 'Node status: nodes:4'. There seems to be a change in output format of 'ctdb status/nodestatus' command, which the nagios is not able to correctly parse, resulting in it showing as 'critical' even when all is fine. Version-Release number of selected component (if applicable): ============================================================== glusterfs-3.8-4-25 nagios-server-addons-0.2.6-1 gluster-nagios-common-0.2.4-1 gluster-nagios-addons-0.2.8-1 How reproducible: ================= 1:1 Additional info: ================= [root@dhcp47-127 ~]# [root@dhcp47-127 ~]# gluster peer status Number of Peers: 3 Hostname: dhcp46-181.lab.eng.blr.redhat.com Uuid: 7dd60909-3f7f-4e64-a6eb-4fced5b9aa98 State: Peer in Cluster (Connected) Hostname: dhcp46-47.lab.eng.blr.redhat.com Uuid: 8ddacadf-24cc-4631-8742-318995b55f3b State: Peer in Cluster (Connected) Hostname: dhcp47-140.lab.eng.blr.redhat.com Uuid: 2e262e06-d728-4ed6-9375-0c5ae72379af State: Peer in Cluster (Connected) [root@dhcp47-127 ~]# [root@dhcp47-127 ~]# [root@dhcp47-127 ~]# gluster v list ctdb saturday-saturday [root@dhcp47-127 ~]# [root@dhcp47-127 ~]# [root@dhcp47-127 ~]# gluster v info Volume Name: ctdb Type: Replicate Volume ID: 64980fe9-85ea-487e-8d0d-39b70c8626b0 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 4 = 4 Transport-type: tcp Bricks: Brick1: 10.70.47.127:/bricks/brick8/ctdb Brick2: 10.70.46.181:/bricks/brick8/ctdb Brick3: 10.70.46.47:/bricks/brick8/ctdb Brick4: 10.70.47.140:/bricks/brick8/ctdb Options Reconfigured: nfs.disable: on transport.address-family: inet Volume Name: saturday-saturday Type: Distributed-Replicate Volume ID: 4a24c34c-1144-4f07-9763-6e232c037a67 Status: Started Snapshot Count: 2 Number of Bricks: 6 x 2 = 12 Transport-type: tcp Bricks: Brick1: dhcp47-127.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick0 Brick2: dhcp46-181.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick1 Brick3: dhcp46-47.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick2 Brick4: dhcp47-140.lab.eng.blr.redhat.com:/bricks/brick0/saturday-saturday_brick3 Brick5: dhcp47-127.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick4 Brick6: dhcp46-181.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick5 Brick7: dhcp46-47.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick6 Brick8: dhcp47-140.lab.eng.blr.redhat.com:/bricks/brick1/saturday-saturday_brick7 Brick9: dhcp47-127.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick8 Brick10: dhcp46-181.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick9 Brick11: dhcp46-47.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick10 Brick12: dhcp47-140.lab.eng.blr.redhat.com:/bricks/brick2/saturday-saturday_brick11 Options Reconfigured: diagnostics.count-fop-hits: on diagnostics.latency-measurement: on performance.nl-cache: on features.barrier: disable features.show-snapshot-directory: enable features.uss: enable transport.address-family: inet nfs.disable: on server.allow-insecure: on performance.stat-prefetch: on storage.batch-fsync-delay-usec: 0 features.cache-invalidation: on features.cache-invalidation-timeout: 600 performance.cache-invalidation: on performance.md-cache-timeout: 600 network.inode-lru-limit: 50000 performance.cache-samba-metadata: on performance.parallel-readdir: on [root@dhcp47-127 ~]# [root@dhcp47-127 ~]# [root@dhcp47-127 ~]# ctdb ip Public IPs on node 0 10.70.44.154 0 10.70.44.155 2 10.70.44.156 1 10.70.44.157 3 [root@dhcp47-127 ~]# [root@dhcp47-127 ~]# ctdb status Number of nodes:4 pnn:0 10.70.47.127 OK (THIS NODE) pnn:1 10.70.46.181 OK pnn:2 10.70.46.47 OK pnn:3 10.70.47.140 OK Generation:1636868653 Size:4 hash:0 lmaster:0 hash:1 lmaster:1 hash:2 lmaster:2 hash:3 lmaster:3 Recovery mode:NORMAL (0) Recovery master:3 [root@dhcp47-127 ~]# [root@dhcp47-127 ~]# ctdb nodestatus Number of nodes:4 pnn:0 10.70.47.127 OK (THIS NODE) pnn:1 10.70.46.181 OK pnn:2 10.70.46.47 OK pnn:3 10.70.47.140 OK [root@dhcp47-127 ~]# [root@dhcp47-127 ~]# [root@dhcp47-127 ~]# rpm -qa | egrep nagios|gluster unrecognized word: nagios-server-addons-0.2.6-1.el7rhgs.x86_64 (position 0) [root@dhcp47-127 ~]# rpm -qa | grep gluster vdsm-gluster-4.17.33-1.1.el7rhgs.noarch glusterfs-libs-3.8.4-25.el7rhgs.x86_64 glusterfs-cli-3.8.4-25.el7rhgs.x86_64 samba-vfs-glusterfs-4.6.3-0.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-25.el7rhgs.x86_64 glusterfs-server-3.8.4-25.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-debuginfo-3.8.4-24.el7rhgs.x86_64 glusterfs-api-3.8.4-25.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-25.el7rhgs.x86_64 python-gluster-3.8.4-25.el7rhgs.noarch glusterfs-fuse-3.8.4-25.el7rhgs.x86_64 gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64 glusterfs-3.8.4-25.el7rhgs.x86_64 glusterfs-rdma-3.8.4-25.el7rhgs.x86_64 [root@dhcp47-127 ~]# [root@dhcp47-127 ~]# rpm -qa | grep nagios nagios-server-addons-0.2.6-1.el7rhgs.x86_64 nagios-plugins-1.4.16-12.el7rhgs.x86_64 nagios-plugins-procs-1.4.16-12.el7rhgs.x86_64 nagios-plugins-ping-1.4.16-12.el7rhgs.x86_64 nagios-3.5.1-9.el7.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch pnp4nagios-0.6.22-3.1.el7rhgs.x86_64 nagios-plugins-ide_smart-1.4.16-12.el7rhgs.x86_64 nagios-plugins-dummy-1.4.16-12.el7rhgs.x86_64 nagios-common-3.5.1-9.el7.x86_64 nagios-plugins-nrpe-2.15-4.2.el7rhgs.x86_64 gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64 [root@dhcp47-127 ~]# [root@dhcp47-127 ~]# On nagios web UI: =================== CTDB CRITICAL 05-18-2017 09:28:22 0d 3h 19m 6s 3/3 Node status: nodes:4 Current Status: CRITICAL (for 0d 3h 19m 45s) Status Information: Node status: nodes:4 Performance Data: Current Attempt: 3/3 (HARD state) Last Check Time: 05-18-2017 09:29:24 Check Type: PASSIVE Check Latency / Duration: N/A / 0.000 seconds Next Scheduled Check: 05-19-2017 06:30:43 Last State Change: 05-18-2017 06:09:45 Last Notification: 05-18-2017 08:12:35 (notification 2) Is This Service Flapping? NO (0.00% state change) In Scheduled Downtime? NO Last Update: 05-18-2017 09:29:26 ( 0d 0h 0m 4s ago) Active Checks: ENABLED Passive Checks: ENABLED Obsessing: ENABLED Notifications: ENABLED Event Handler: ENABLED Flap Detection: ENABLED
Hi Sweta, Good catch. It's a bug in ctdb for which I have raised an upstream bug. This sneaked in while re-factoring ctdb code in v4.5. Judging from the nagios messages, I assume that it is trying to parse the displayed output of `ctdb nodestatus`. Instead it can also rely on the exit status after running `ctdb nodestatus` which reflects the health status of current node.
This has been fixed upstream via the following commits: [1] https://git.samba.org/?p=samba.git;a=commit;h=a600d467e2842ab05e429c5a67be5b222ddd1c12 [2] https://git.samba.org/?p=samba.git;a=commit;h=1d10c8e9e637619b754b4a273d3c714fbca7d503 [3] https://git.samba.org/?p=samba.git;a=commit;h=ade535371b86294c12ca3f7eb98d8ef7ecd29caa
Following the discussion we had in our team, moving the component to ctdb as it is a regression from previous version.
Tested and verified this on the build samba-4.6.3-3.el7rhgs.x86_64, ctdb-4.6.3-3.el7rhgs.x86_64 and glusterfs-3.8.4-28.el7rhgs.x86_64 Ctdb nodestatus command (on CLI) gives the output of ctdb status only of the localhost (as expected). Nagios when configured displays the ctdb service as OK. Moving this to verified in 3.3. [root@dhcp47-121 ~]# ctdb status Number of nodes:6 pnn:0 10.70.47.113 OK pnn:1 10.70.47.114 OK pnn:2 10.70.47.115 OK pnn:3 10.70.47.116 OK pnn:4 10.70.47.117 OK pnn:5 10.70.47.121 OK (THIS NODE) Generation:368498462 Size:6 hash:0 lmaster:0 hash:1 lmaster:1 hash:2 lmaster:2 hash:3 lmaster:3 hash:4 lmaster:4 hash:5 lmaster:5 Recovery mode:NORMAL (0) Recovery master:4 [root@dhcp47-121 ~]# ctdb nodestatus pnn:5 10.70.47.121 OK (THIS NODE) [root@dhcp47-121 ~]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2780