821715 – ctdbd on a node crashes when another node in the cluster is brought down.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 821715 - ctdbd on a node crashes when another node in the cluster is brought down.

Summary: ctdbd on a node crashes when another node in the cluster is brought down.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	ctdb
Sub Component:
Version:	6.2
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Sumit Bose
QA Contact:	amainkar
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	825180
TreeView+	depends on / blocked

Reported:	2012-05-15 12:17 UTC by Rachana Patel
Modified:	2018-11-30 21:05 UTC (History)
CC List:	3 users (show)
Fixed In Version:	ctdb-1.0.114.5-2.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	825180 (view as bug list)
Environment:
Last Closed:	2013-02-21 08:44:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
contains ctdb logs and gdb "bt full" output (where applicable) (933.84 KB, application/x-gzip) 2012-05-15 12:17 UTC, Rachana Patel	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2013:0337	0	normal	SHIPPED_LIVE	ctdb bug fix update	2013-02-20 20:54:08 UTC

Description Rachana Patel 2012-05-15 12:17:38 UTC

Created attachment 584648 [details]
contains ctdb logs and gdb "bt full" output (where applicable)

Description of problem:
In a CTDB cluster of 4 nodes serving 3 public addresses, one of the nodes is brought down. The ctdbd process(es) on one of other nodes crashed. 
Glusterfs* is being used as the shared filesystem hosting the lockfile for ctdb.
The "nodes" file (/etc/ctdb/nodes) is placed in the shared filesystem as well. Each node has its own /etc/ctdb/public_addresses.

Version-Release number of selected component (if applicable):
CTDB version: 1.0.114.3-3.el6

How reproducible:
always

Steps to Reproduce:
1.Build a ctdb cluster of size 4.
2.Reboot one of the nodes in the cluster.
3.On one of the machines, we see that the ctdbd crashed.
  
Actual results:
ctdbd crashes with signal 6.

Expected results:
ctdbd should not crash.

Additional info:
*Glusterfs is a network filesystem.

Comment 2 krishnan parthasarathi 2012-05-15 12:42:02 UTC

Ctdb config information: (missed out in description)

[shared information present on a Glusterfs mount]

[root@QA-42 ~]# cat /gluster/lock/nodes
172.17.251.81
172.17.251.82
172.17.251.83
172.17.251.84

[root@QA-42 ~]# cat /gluster/lock/ctdb 
CTDB_RECOVERY_LOCK=/gluster/lock/lockfile
#CIFS only
CTDB_PUBLIC_ADDRESSES=/etc/ctdb/public_addresses
CTDB_MANAGES_SAMBA=yes
#CIFS only
CTDB_NODES=/etc/ctdb/nodes

---------------------------------
[Node specific config info]
[root@QA-42 ~]# cat /etc/ctdb/public_addresses 
172.17.251.241/24 eth4
172.17.251.242/24 eth4
172.17.251.243/24 eth4

Comment 3 Rachana Patel 2012-05-17 10:33:54 UTC

We haven't encountered the same issue with the build suggested by Sumit (https://brewweb.devel.redhat.com/taskinfo?taskID=4408685)

Comment 4 Rachana Patel 2012-05-17 11:14:20 UTC

from above link we have installed following rpm -
ctdb-debuginfo-1.2.39-1.el6.x86_64
ctdb-devel-1.2.39-1.el6.x86_64
ctdb-1.2.39-1.el6.x86_64

Comment 5 RHEL Program Management 2012-05-21 06:49:30 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 6 Abhijith Das 2012-05-21 17:41:09 UTC

FWIW, I tried with my 5 node GFS2 cluster but could not reproduce the failure. This is the ctdb version on my nodes: ctdb-1.0.114.3-4.el6.x86_64

Note: I don't have samba running, just cman, clvmd, gfs2 (for ctdb lockfile) and ctdb. Killing one or two nodes doesn't kill ctdbd on other nodes. Membership changes are as expected when the nodes go down.

Rachana/Krishnan, is there anything different you're doing?

Comment 7 krishnan parthasarathi 2012-05-22 09:49:16 UTC

Abhijith,

The differences are:
- Glusterfs is the shared filesystem.
- We 'share' ctdb lockfile, 'nodes' file and the sysconfig/ctdb file
- We have ctdb starting samba.

Comment 8 Sumit Bose 2012-05-22 14:27:53 UTC

Please try and test the packages from https://brewweb.devel.redhat.com/taskinfo?taskID=4437472 and see if they work for you? Thank you.

Comment 12 Justin Payne 2013-01-30 16:20:52 UTC

Verified in ctdb-1.0.114.5-3.

[root@dash-03 ~]# rpm -q ctdb
ctdb-1.0.114.3-3.el6.x86_64
[root@dash-03 ~]# service ctdb status
Checking for ctdbd service: 
Number of nodes:3
pnn:0 10.15.89.168     HEALTHY
pnn:1 10.15.89.169     HEALTHY
pnn:2 10.15.89.170     HEALTHY (THIS NODE)
Generation:185497490
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:0
[root@dash-03 ~]# ctdb ping -n all
response from 0 time=0.000367 sec  (1 clients)
response from 1 time=0.000525 sec  (1 clients)
response from 2 time=0.000104 sec  (2 clients)

[root@dash-01 ~]# rpm -q ctdb
ctdb-1.0.114.3-3.el6.x86_64
[root@dash-01 ~]# service ctdb status
Checking for ctdbd service:
Number of nodes:3
pnn:0 10.15.89.168     HEALTHY (THIS NODE)
pnn:1 10.15.89.169     HEALTHY
pnn:2 10.15.89.170     HEALTHY
Generation:185497490
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:0
[root@dash-01 ~]# ctdb ping -n all
response from 0 time=0.000086 sec  (2 clients)
response from 1 time=0.000510 sec  (1 clients)
response from 2 time=0.000371 sec  (1 clients)


[root@dash-03 ~]# reboot && exit

Broadcast message from root@dash-03
        (/dev/pts/0) at 9:59 ...

The system is going down for reboot NOW!
logout
Connection to dash-03.lab.msp.redhat.com closed.

-bash-4.1$ for i in `seq 1 3`; do qarsh root@dash-0$i service ctdb status; done
ctdb dead but pid file exists
Checking for ctdbd service:
Number of nodes:3
pnn:0 10.15.89.168     DISCONNECTED|UNHEALTHY|INACTIVE
pnn:1 10.15.89.169     HEALTHY (THIS NODE)
pnn:2 10.15.89.170     DISCONNECTED|UNHEALTHY|INACTIVE
Generation:488801060
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:NORMAL (0)
Recovery master:0
Checking for ctdbd service:   ctdbd not running. ctdb is stopped


========== After update ===============

[root@dash-03 ~]# rpm -q ctdb
ctdb-1.0.114.5-3.el6.x86_64

[root@dash-03 ~]# service ctdb status
Checking for ctdbd service: 
Number of nodes:3
pnn:0 10.15.89.168     HEALTHY
pnn:1 10.15.89.169     HEALTHY
pnn:2 10.15.89.170     HEALTHY (THIS NODE)
Generation:449208127
Size:3
hash:0 lmaster:0
hash:1 lmaster:1
hash:2 lmaster:2
Recovery mode:NORMAL (0)
Recovery master:0
[root@dash-03 ~]# ctdb ping -n all
response from 0 time=0.000389 sec  (1 clients)
response from 1 time=0.000474 sec  (3 clients)
response from 2 time=0.000113 sec  (2 clients)

[root@dash-03 ~]# reboot && exit

[root@dash-01 ~]# rpm -q ctdb
ctdb-1.0.114.5-3.el6.x86_64

[root@dash-01 ~]# service ctdb status
Checking for ctdbd service:
Number of nodes:3
pnn:0 10.15.89.168     HEALTHY (THIS NODE)
pnn:1 10.15.89.169     HEALTHY
pnn:2 10.15.89.170     DISCONNECTED|UNHEALTHY|INACTIVE
Generation:1754606152
Size:2
hash:0 lmaster:0
hash:1 lmaster:1
Recovery mode:NORMAL (0)
Recovery master:0

Comment 14 errata-xmlrpc 2013-02-21 08:44:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0337.html

Note You need to log in before you can comment on or make changes to this bug.