Bug 821715 - ctdbd on a node crashes when another node in the cluster is brought down.
ctdbd on a node crashes when another node in the cluster is brought down.
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: ctdb (Show other bugs)
6.2
x86_64 Linux
unspecified Severity high
: rc
: ---
Assigned To: Sumit Bose
amainkar
:
Depends On:
Blocks: 825180
  Show dependency treegraph
 
Reported: 2012-05-15 08:17 EDT by Rachana Patel
Modified: 2015-04-20 07:58 EDT (History)
3 users (show)

See Also:
Fixed In Version: ctdb-1.0.114.5-2.el6
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 825180 (view as bug list)
Environment:
Last Closed: 2013-02-21 03:44:10 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
contains ctdb logs and gdb "bt full" output (where applicable) (933.84 KB, application/x-gzip)
2012-05-15 08:17 EDT, Rachana Patel
no flags Details

  None (edit)
Description Rachana Patel 2012-05-15 08:17:38 EDT
Created attachment 584648 [details]
contains ctdb logs and gdb "bt full" output (where applicable)

Description of problem:
In a CTDB cluster of 4 nodes serving 3 public addresses, one of the nodes is brought down. The ctdbd process(es) on one of other nodes crashed. 
Glusterfs* is being used as the shared filesystem hosting the lockfile for ctdb.
The "nodes" file (/etc/ctdb/nodes) is placed in the shared filesystem as well. Each node has its own /etc/ctdb/public_addresses.

Version-Release number of selected component (if applicable):
CTDB version: 1.0.114.3-3.el6

How reproducible:
always

Steps to Reproduce:
1.Build a ctdb cluster of size 4.
2.Reboot one of the nodes in the cluster.
3.On one of the machines, we see that the ctdbd crashed.
  
Actual results:
ctdbd crashes with signal 6.

Expected results:
ctdbd should not crash.

Additional info:
*Glusterfs is a network filesystem.
Comment 2 krishnan parthasarathi 2012-05-15 08:42:02 EDT
Ctdb config information: (missed out in description)

[shared information present on a Glusterfs mount]

[root@QA-42 ~]# cat /gluster/lock/nodes
172.17.251.81
172.17.251.82
172.17.251.83
172.17.251.84

[root@QA-42 ~]# cat /gluster/lock/ctdb 
CTDB_RECOVERY_LOCK=/gluster/lock/lockfile
#CIFS only
CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses
CTDB_MANAGES_SAMBA=yes
#CIFS only
CTDB_NODES=/etc/ctdb/nodes

---------------------------------
[Node specific config info]
[root@QA-42 ~]# cat /etc/ctdb/public_addresses 
172.17.251.241/24 eth4
172.17.251.242/24 eth4
172.17.251.243/24 eth4
Comment 3 Rachana Patel 2012-05-17 06:33:54 EDT
We haven't encountered the same issue with the build suggested by Sumit (https://brewweb.devel.redhat.com/taskinfo?taskID=4408685)
Comment 4 Rachana Patel 2012-05-17 07:14:20 EDT
from above link we have installed following rpm -
ctdb-debuginfo-1.2.39-1.el6.x86_64
ctdb-devel-1.2.39-1.el6.x86_64
ctdb-1.2.39-1.el6.x86_64
Comment 5 RHEL Product and Program Management 2012-05-21 02:49:30 EDT
This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.
Comment 6 Abhijith Das 2012-05-21 13:41:09 EDT
FWIW, I tried with my 5 node GFS2 cluster but could not reproduce the failure. This is the ctdb version on my nodes: ctdb-1.0.114.3-4.el6.x86_64

Note: I don't have samba running, just cman, clvmd, gfs2 (for ctdb lockfile) and ctdb. Killing one or two nodes doesn't kill ctdbd on other nodes. Membership changes are as expected when the nodes go down.

Rachana/Krishnan, is there anything different you're doing?
Comment 7 krishnan parthasarathi 2012-05-22 05:49:16 EDT
Abhijith,

The differences are:
- Glusterfs is the shared filesystem.
- We 'share' ctdb lockfile, 'nodes' file and the sysconfig/ctdb file
- We have ctdb starting samba.
Comment 8 Sumit Bose 2012-05-22 10:27:53 EDT
Please try and test the packages from https://brewweb.devel.redhat.com/taskinfo?taskID=4437472 and see if they work for you? Thank you.
Comment 12 Justin Payne 2013-01-30 11:20:52 EST
Verified in ctdb-1.0.114.5-3.

[root@dash-03 ~]# rpm -q ctdb
ctdb-1.0.114.3-3.el6.x86_64
[root@dash-03 ~]# service ctdb status
Checking for ctdbd service: 
Number of nodes:3
pnn:0 10.15.89.168     HEALTHY
pnn:1 10.15.89.169     HEALTHY
pnn:2 10.15.89.170     HEALTHY (THIS NODE)
Generation:185497490
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:0
[root@dash-03 ~]# ctdb ping -n all
response from 0 time=0.000367 sec  (1 clients)
response from 1 time=0.000525 sec  (1 clients)
response from 2 time=0.000104 sec  (2 clients)

[root@dash-01 ~]# rpm -q ctdb
ctdb-1.0.114.3-3.el6.x86_64
[root@dash-01 ~]# service ctdb status
Checking for ctdbd service:
Number of nodes:3
pnn:0 10.15.89.168     HEALTHY (THIS NODE)
pnn:1 10.15.89.169     HEALTHY
pnn:2 10.15.89.170     HEALTHY
Generation:185497490
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:0
[root@dash-01 ~]# ctdb ping -n all
response from 0 time=0.000086 sec  (2 clients)
response from 1 time=0.000510 sec  (1 clients)
response from 2 time=0.000371 sec  (1 clients)


[root@dash-03 ~]# reboot && exit

Broadcast message from root@dash-03
        (/dev/pts/0) at 9:59 ...

The system is going down for reboot NOW!
logout
Connection to dash-03.lab.msp.redhat.com closed.

-bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i service ctdb status; done
ctdb dead but pid file exists
Checking for ctdbd service:
Number of nodes:3
pnn:0 10.15.89.168     DISCONNECTED|UNHEALTHY|INACTIVE
pnn:1 10.15.89.169     HEALTHY (THIS NODE)
pnn:2 10.15.89.170     DISCONNECTED|UNHEALTHY|INACTIVE
Generation:488801060
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:NORMAL (0)
Recovery master:0
Checking for ctdbd service:   ctdbd not running. ctdb is stopped


========== After update ===============

[root@dash-03 ~]# rpm -q ctdb
ctdb-1.0.114.5-3.el6.x86_64

[root@dash-03 ~]# service ctdb status
Checking for ctdbd service: 
Number of nodes:3
pnn:0 10.15.89.168     HEALTHY
pnn:1 10.15.89.169     HEALTHY
pnn:2 10.15.89.170     HEALTHY (THIS NODE)
Generation:449208127
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:0
[root@dash-03 ~]# ctdb ping -n all
response from 0 time=0.000389 sec  (1 clients)
response from 1 time=0.000474 sec  (3 clients)
response from 2 time=0.000113 sec  (2 clients)

[root@dash-03 ~]# reboot && exit

[root@dash-01 ~]# rpm -q ctdb
ctdb-1.0.114.5-3.el6.x86_64

[root@dash-01 ~]# service ctdb status
Checking for ctdbd service:
Number of nodes:3
pnn:0 10.15.89.168     HEALTHY (THIS NODE)
pnn:1 10.15.89.169     HEALTHY
pnn:2 10.15.89.170     DISCONNECTED|UNHEALTHY|INACTIVE
Generation:1754606152
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:NORMAL (0)
Recovery master:0
Comment 14 errata-xmlrpc 2013-02-21 03:44:10 EST
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0337.html

Note You need to log in before you can comment on or make changes to this bug.