Bug 821715

Summary: ctdbd on a node crashes when another node in the cluster is brought down.
Product: Red Hat Enterprise Linux 6 Reporter: Rachana Patel <racpatel>
Component: ctdbAssignee: Sumit Bose <sbose>
Status: CLOSED ERRATA QA Contact: amainkar
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.2CC: adas, jpayne, kparthas
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ctdb-1.0.114.5-2.el6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 825180 (view as bug list) Environment:
Last Closed: 2013-02-21 08:44:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 825180    
Attachments:
Description Flags
contains ctdb logs and gdb "bt full" output (where applicable) none

Description Rachana Patel 2012-05-15 12:17:38 UTC
Created attachment 584648 [details]
contains ctdb logs and gdb "bt full" output (where applicable)

Description of problem:
In a CTDB cluster of 4 nodes serving 3 public addresses, one of the nodes is brought down. The ctdbd process(es) on one of other nodes crashed. 
Glusterfs* is being used as the shared filesystem hosting the lockfile for ctdb.
The "nodes" file (/etc/ctdb/nodes) is placed in the shared filesystem as well. Each node has its own /etc/ctdb/public_addresses.

Version-Release number of selected component (if applicable):
CTDB version: 1.0.114.3-3.el6

How reproducible:
always

Steps to Reproduce:
1.Build a ctdb cluster of size 4.
2.Reboot one of the nodes in the cluster.
3.On one of the machines, we see that the ctdbd crashed.
  
Actual results:
ctdbd crashes with signal 6.

Expected results:
ctdbd should not crash.

Additional info:
*Glusterfs is a network filesystem.

Comment 2 krishnan parthasarathi 2012-05-15 12:42:02 UTC
Ctdb config information: (missed out in description)

[shared information present on a Glusterfs mount]

[root@QA-42 ~]# cat /gluster/lock/nodes
172.17.251.81
172.17.251.82
172.17.251.83
172.17.251.84

[root@QA-42 ~]# cat /gluster/lock/ctdb 
CTDB_RECOVERY_LOCK=/gluster/lock/lockfile
#CIFS only
CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses
CTDB_MANAGES_SAMBA=yes
#CIFS only
CTDB_NODES=/etc/ctdb/nodes

---------------------------------
[Node specific config info]
[root@QA-42 ~]# cat /etc/ctdb/public_addresses 
172.17.251.241/24 eth4
172.17.251.242/24 eth4
172.17.251.243/24 eth4

Comment 3 Rachana Patel 2012-05-17 10:33:54 UTC
We haven't encountered the same issue with the build suggested by Sumit (https://brewweb.devel.redhat.com/taskinfo?taskID=4408685)

Comment 4 Rachana Patel 2012-05-17 11:14:20 UTC
from above link we have installed following rpm -
ctdb-debuginfo-1.2.39-1.el6.x86_64
ctdb-devel-1.2.39-1.el6.x86_64
ctdb-1.2.39-1.el6.x86_64

Comment 5 RHEL Program Management 2012-05-21 06:49:30 UTC
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 6 Abhijith Das 2012-05-21 17:41:09 UTC
FWIW, I tried with my 5 node GFS2 cluster but could not reproduce the failure. This is the ctdb version on my nodes: ctdb-1.0.114.3-4.el6.x86_64

Note: I don't have samba running, just cman, clvmd, gfs2 (for ctdb lockfile) and ctdb. Killing one or two nodes doesn't kill ctdbd on other nodes. Membership changes are as expected when the nodes go down.

Rachana/Krishnan, is there anything different you're doing?

Comment 7 krishnan parthasarathi 2012-05-22 09:49:16 UTC
Abhijith,

The differences are:
- Glusterfs is the shared filesystem.
- We 'share' ctdb lockfile, 'nodes' file and the sysconfig/ctdb file
- We have ctdb starting samba.

Comment 8 Sumit Bose 2012-05-22 14:27:53 UTC
Please try and test the packages from https://brewweb.devel.redhat.com/taskinfo?taskID=4437472 and see if they work for you? Thank you.

Comment 12 Justin Payne 2013-01-30 16:20:52 UTC
Verified in ctdb-1.0.114.5-3.

[root@dash-03 ~]# rpm -q ctdb
ctdb-1.0.114.3-3.el6.x86_64
[root@dash-03 ~]# service ctdb status
Checking for ctdbd service: 
Number of nodes:3
pnn:0 10.15.89.168     HEALTHY
pnn:1 10.15.89.169     HEALTHY
pnn:2 10.15.89.170     HEALTHY (THIS NODE)
Generation:185497490
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:0
[root@dash-03 ~]# ctdb ping -n all
response from 0 time=0.000367 sec  (1 clients)
response from 1 time=0.000525 sec  (1 clients)
response from 2 time=0.000104 sec  (2 clients)

[root@dash-01 ~]# rpm -q ctdb
ctdb-1.0.114.3-3.el6.x86_64
[root@dash-01 ~]# service ctdb status
Checking for ctdbd service:
Number of nodes:3
pnn:0 10.15.89.168     HEALTHY (THIS NODE)
pnn:1 10.15.89.169     HEALTHY
pnn:2 10.15.89.170     HEALTHY
Generation:185497490
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:0
[root@dash-01 ~]# ctdb ping -n all
response from 0 time=0.000086 sec  (2 clients)
response from 1 time=0.000510 sec  (1 clients)
response from 2 time=0.000371 sec  (1 clients)


[root@dash-03 ~]# reboot && exit

Broadcast message from root@dash-03
        (/dev/pts/0) at 9:59 ...

The system is going down for reboot NOW!
logout
Connection to dash-03.lab.msp.redhat.com closed.

-bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i service ctdb status; done
ctdb dead but pid file exists
Checking for ctdbd service:
Number of nodes:3
pnn:0 10.15.89.168     DISCONNECTED|UNHEALTHY|INACTIVE
pnn:1 10.15.89.169     HEALTHY (THIS NODE)
pnn:2 10.15.89.170     DISCONNECTED|UNHEALTHY|INACTIVE
Generation:488801060
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:NORMAL (0)
Recovery master:0
Checking for ctdbd service:   ctdbd not running. ctdb is stopped


========== After update ===============

[root@dash-03 ~]# rpm -q ctdb
ctdb-1.0.114.5-3.el6.x86_64

[root@dash-03 ~]# service ctdb status
Checking for ctdbd service: 
Number of nodes:3
pnn:0 10.15.89.168     HEALTHY
pnn:1 10.15.89.169     HEALTHY
pnn:2 10.15.89.170     HEALTHY (THIS NODE)
Generation:449208127
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:0
[root@dash-03 ~]# ctdb ping -n all
response from 0 time=0.000389 sec  (1 clients)
response from 1 time=0.000474 sec  (3 clients)
response from 2 time=0.000113 sec  (2 clients)

[root@dash-03 ~]# reboot && exit

[root@dash-01 ~]# rpm -q ctdb
ctdb-1.0.114.5-3.el6.x86_64

[root@dash-01 ~]# service ctdb status
Checking for ctdbd service:
Number of nodes:3
pnn:0 10.15.89.168     HEALTHY (THIS NODE)
pnn:1 10.15.89.169     HEALTHY
pnn:2 10.15.89.170     DISCONNECTED|UNHEALTHY|INACTIVE
Generation:1754606152
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:NORMAL (0)
Recovery master:0

Comment 14 errata-xmlrpc 2013-02-21 08:44:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0337.html