Bug 821715

Summary:

ctdbd on a node crashes when another node in the cluster is brought down.

Product:

Red Hat Enterprise Linux 6

Reporter:

Rachana Patel <racpatel>

Component:

ctdb

Assignee:

Sumit Bose <sbose>

Status:

CLOSED ERRATA

QA Contact:

amainkar

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

6.2

CC:

adas, jpayne, kparthas

Target Milestone:

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

ctdb-1.0.114.5-2.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

825180 (view as bug list)

Environment:

Last Closed:

2013-02-21 08:44:10 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

825180

Attachments:

Description	Flags
contains ctdb logs and gdb "bt full" output (where applicable)	none

Description Rachana Patel 2012-05-15 12:17:38 UTC

Created attachment 584648 [details]
contains ctdb logs and gdb "bt full" output (where applicable)

Description of problem:
In a CTDB cluster of 4 nodes serving 3 public addresses, one of the nodes is brought down. The ctdbd process(es) on one of other nodes crashed. 
Glusterfs* is being used as the shared filesystem hosting the lockfile for ctdb.
The "nodes" file (/etc/ctdb/nodes) is placed in the shared filesystem as well. Each node has its own /etc/ctdb/public_addresses.

Version-Release number of selected component (if applicable):
CTDB version: 1.0.114.3-3.el6

How reproducible:
always

Steps to Reproduce:
1.Build a ctdb cluster of size 4.
2.Reboot one of the nodes in the cluster.
3.On one of the machines, we see that the ctdbd crashed.
  
Actual results:
ctdbd crashes with signal 6.

Expected results:
ctdbd should not crash.

Additional info:
*Glusterfs is a network filesystem.

Comment 2 krishnan parthasarathi 2012-05-15 12:42:02 UTC

Ctdb config information: (missed out in description)

[shared information present on a Glusterfs mount]

[root@QA-42 ~]# cat /gluster/lock/nodes
172.17.251.81
172.17.251.82
172.17.251.83
172.17.251.84

[root@QA-42 ~]# cat /gluster/lock/ctdb 
CTDB_RECOVERY_LOCK=/gluster/lock/lockfile
#CIFS only
CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses
CTDB_MANAGES_SAMBA=yes
#CIFS only
CTDB_NODES=/etc/ctdb/nodes

---------------------------------
[Node specific config info]
[root@QA-42 ~]# cat /etc/ctdb/public_addresses 
172.17.251.241/24 eth4
172.17.251.242/24 eth4
172.17.251.243/24 eth4

Comment 3 Rachana Patel 2012-05-17 10:33:54 UTC

We haven't encountered the same issue with the build suggested by Sumit (https://brewweb.devel.redhat.com/taskinfo?taskID=4408685)

Comment 4 Rachana Patel 2012-05-17 11:14:20 UTC

from above link we have installed following rpm -
ctdb-debuginfo-1.2.39-1.el6.x86_64
ctdb-devel-1.2.39-1.el6.x86_64
ctdb-1.2.39-1.el6.x86_64

Comment 5 RHEL Program Management 2012-05-21 06:49:30 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 6 Abhijith Das 2012-05-21 17:41:09 UTC

FWIW, I tried with my 5 node GFS2 cluster but could not reproduce the failure. This is the ctdb version on my nodes: ctdb-1.0.114.3-4.el6.x86_64

Note: I don't have samba running, just cman, clvmd, gfs2 (for ctdb lockfile) and ctdb. Killing one or two nodes doesn't kill ctdbd on other nodes. Membership changes are as expected when the nodes go down.

Rachana/Krishnan, is there anything different you're doing?

Comment 7 krishnan parthasarathi 2012-05-22 09:49:16 UTC

Abhijith,

The differences are:
- Glusterfs is the shared filesystem.
- We 'share' ctdb lockfile, 'nodes' file and the sysconfig/ctdb file
- We have ctdb starting samba.

Comment 8 Sumit Bose 2012-05-22 14:27:53 UTC

Please try and test the packages from https://brewweb.devel.redhat.com/taskinfo?taskID=4437472 and see if they work for you? Thank you.

Comment 12 Justin Payne 2013-01-30 16:20:52 UTC

Verified in ctdb-1.0.114.5-3.

[root@dash-03 ~]# rpm -q ctdb
ctdb-1.0.114.3-3.el6.x86_64
[root@dash-03 ~]# service ctdb status
Checking for ctdbd service: 
Number of nodes:3
pnn:0 10.15.89.168     HEALTHY
pnn:1 10.15.89.169     HEALTHY
pnn:2 10.15.89.170     HEALTHY (THIS NODE)
Generation:185497490
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:0
[root@dash-03 ~]# ctdb ping -n all
response from 0 time=0.000367 sec  (1 clients)
response from 1 time=0.000525 sec  (1 clients)
response from 2 time=0.000104 sec  (2 clients)

[root@dash-01 ~]# rpm -q ctdb
ctdb-1.0.114.3-3.el6.x86_64
[root@dash-01 ~]# service ctdb status
Checking for ctdbd service:
Number of nodes:3
pnn:0 10.15.89.168     HEALTHY (THIS NODE)
pnn:1 10.15.89.169     HEALTHY
pnn:2 10.15.89.170     HEALTHY
Generation:185497490
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:0
[root@dash-01 ~]# ctdb ping -n all
response from 0 time=0.000086 sec  (2 clients)
response from 1 time=0.000510 sec  (1 clients)
response from 2 time=0.000371 sec  (1 clients)


[root@dash-03 ~]# reboot && exit

Broadcast message from root@dash-03
        (/dev/pts/0) at 9:59 ...

The system is going down for reboot NOW!
logout
Connection to dash-03.lab.msp.redhat.com closed.

-bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i service ctdb status; done
ctdb dead but pid file exists
Checking for ctdbd service:
Number of nodes:3
pnn:0 10.15.89.168     DISCONNECTED|UNHEALTHY|INACTIVE
pnn:1 10.15.89.169     HEALTHY (THIS NODE)
pnn:2 10.15.89.170     DISCONNECTED|UNHEALTHY|INACTIVE
Generation:488801060
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:NORMAL (0)
Recovery master:0
Checking for ctdbd service:   ctdbd not running. ctdb is stopped


========== After update ===============

[root@dash-03 ~]# rpm -q ctdb
ctdb-1.0.114.5-3.el6.x86_64

[root@dash-03 ~]# service ctdb status
Checking for ctdbd service: 
Number of nodes:3
pnn:0 10.15.89.168     HEALTHY
pnn:1 10.15.89.169     HEALTHY
pnn:2 10.15.89.170     HEALTHY (THIS NODE)
Generation:449208127
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:0
[root@dash-03 ~]# ctdb ping -n all
response from 0 time=0.000389 sec  (1 clients)
response from 1 time=0.000474 sec  (3 clients)
response from 2 time=0.000113 sec  (2 clients)

[root@dash-03 ~]# reboot && exit

[root@dash-01 ~]# rpm -q ctdb
ctdb-1.0.114.5-3.el6.x86_64

[root@dash-01 ~]# service ctdb status
Checking for ctdbd service:
Number of nodes:3
pnn:0 10.15.89.168     HEALTHY (THIS NODE)
pnn:1 10.15.89.169     HEALTHY
pnn:2 10.15.89.170     DISCONNECTED|UNHEALTHY|INACTIVE
Generation:1754606152
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:NORMAL (0)
Recovery master:0

Comment 14 errata-xmlrpc 2013-02-21 08:44:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0337.html