Bug 866670

Summary:	ctdb status reports more nodes then available
Product:	Red Hat Enterprise Linux 6	Reporter:	Mark Heslin 🎸 <mheslin>
Component:	ctdb	Assignee:	Sumit Bose <sbose>
Status:	CLOSED ERRATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	6.4	CC:	jpayne, mheslin
Target Milestone:	rc
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	ctdb-1.0.114.5-3.el6	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2013-02-21 08:44:18 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	881827

Description Mark Heslin 🎸 2012-10-15 20:20:19 UTC

Description of problem:

  After removing a ctdb node, 'ctdb status' reports the same number of nodes
  as before the node was removed. Size however, is reported correctly.

Version-Release number of selected component (if applicable):

  # ctdb version
  CTDB version: 1.0.114.3-4.el6

How reproducible:

   Remove a node as per the steps outlined below.

Steps to Reproduce:

 1. Verify cluster status is healthy, all node up (clustat, ctdb status)

 2. On all nodes, edit the /etc/ctdb/nodes file and comment out the node
    to be removed. Do not delete the line for that node, just comment it out
    by adding a ´#´ at the beginning of the line.

 3. On one of the nodes not being removed, run ´ctdb reloadnodes´ 
    to force all nodes to reload the nodesfile. 

    Note - this will automatically stop ctdb on the node being removed.

 4. Use ´ctdb status´ on all remaining nodes and verify that the deleted node 
    no longer shows up in the list
  
Actual results:

  # ctdb status
  Number of nodes:3
  pnn:0 10.0.0.101       OK (THIS NODE)
  pnn:1 10.0.0.102       OK
  Generation:935874625
  Size:2
  hash:0 lmaster:0
  hash:1 lmaster:1
  Recovery mode:NORMAL (0)
  Recovery master:0

Expected results:

  # ctdb status
  Number of nodes:2
  pnn:0 10.0.0.101       OK (THIS NODE)
  pnn:1 10.0.0.102       OK
  Generation:935874625
  Size:2
  hash:0 lmaster:0
  hash:1 lmaster:1
  Recovery mode:NORMAL (0)
  Recovery master:0

Additional info:

I spoke to Sumit Bose and he suggested filing this bug so the change can be made to it upstream.

Configuration details:

  - 3 node cluster, all nodes running RHEL 6.3 
  - HA, RS Add-ons
  - 2 CLVM volumes - 1 lock, 1 data. All fibrechannel (HP MSA) 

# cat /etc/ctdb/nodes
10.0.0.101
10.0.0.102
#10.0.0.103

# cat /etc/ctdb/public_addresses
10.16.142.111/21 bond0
10.16.142.112/21 bond0
10.16.142.113/21 bond0
[root@smb-srv1 ~]# cat /etc/sysconfigtab/ctdb
cat: /etc/sysconfigtab/ctdb: No such file or directory

# cat /etc/sysconfig/ctdb
CTDB_DEBUGLEVEL=ERR
CTDB_NODES=/etc/ctdb/nodes
CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses
CTDB_RECOVERY_LOCK=/share/ctdb/.ctdb.lock
CTDB_MANAGES_SAMBA=yes
CTDB_MANAGES_WINBIND=yes

Comment 3 Justin Payne 2013-01-24 22:01:49 UTC

Verified in ctdb-1.0.114.5-3. We now get the message "(including 1 deleted nodes)" from ctdb_status.

-bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i rpm -q ctdb; done
ctdb-1.0.114.5-3.el6.x86_64
ctdb-1.0.114.5-3.el6.x86_64
ctdb-1.0.114.5-3.el6.x86_64

-bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i clustat; done
Cluster Status for dash @ Thu Jan 24 15:45:07 2013
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 dash-01                                     1 Online, Local
 dash-02                                     2 Online
 dash-03                                     3 Online

Cluster Status for dash @ Thu Jan 24 15:45:05 2013
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 dash-01                                     1 Online
 dash-02                                     2 Online, Local
 dash-03                                     3 Online

Cluster Status for dash @ Thu Jan 24 15:45:09 2013
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 dash-01                                     1 Online
 dash-02                                     2 Online
 dash-03                                     3 Online, Local

-bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i ctdb status; done
Number of nodes:3
pnn:0 10.15.89.168     OK (THIS NODE)
pnn:1 10.15.89.169     OK
pnn:2 10.15.89.170     OK
Generation:1377632015
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:1
Number of nodes:3
pnn:0 10.15.89.168     OK
pnn:1 10.15.89.169     OK (THIS NODE)
pnn:2 10.15.89.170     OK
Generation:1377632015
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:1
Number of nodes:3
pnn:0 10.15.89.168     OK
pnn:1 10.15.89.169     OK
pnn:2 10.15.89.170     OK (THIS NODE)
Generation:1377632015
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:1

================ Comment out node dash-03 in nodes file ========================

[root@dash-01 ~]# ctdb reloadnodes
2013/01/24 15:48:50.552267 [14067]: Reloading nodes file on node 1
2013/01/24 15:48:50.552954 [14067]: Reloading nodes file on node 2
2013/01/24 15:48:50.553580 [14067]: Reloading nodes file on node 0

-bash-4.1$ for i in `seq 1 2`; do qarsh root@dash-0$i ctdb status; done
Number of nodes:3 (including 1 deleted nodes)
pnn:0 10.15.89.168     OK (THIS NODE)
pnn:1 10.15.89.169     OK
Generation:1884797164
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:NORMAL (0)
Recovery master:1
Number of nodes:3 (including 1 deleted nodes)
pnn:0 10.15.89.168     OK
pnn:1 10.15.89.169     OK (THIS NODE)
Generation:1884797164
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:NORMAL (0)
Recovery master:1

Comment 5 errata-xmlrpc 2013-02-21 08:44:18 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0337.html