825180 – ctdbd on a node crashes when another node in the cluster is brought down.

Bug 825180 - ctdbd on a node crashes when another node in the cluster is brought down.

Summary: ctdbd on a node crashes when another node in the cluster is brought down.

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	samba
Sub Component:
Version:	2.0
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Christopher R. Hertel
QA Contact:	Sudhir D
Docs Contact:
URL:
Whiteboard:
Depends On:	821715
Blocks:	956495
TreeView+	depends on / blocked

Reported:	2012-05-25 09:57 UTC by Sumit Bose
Modified:	2014-09-29 00:21 UTC (History)
CC List:	9 users (show)
Fixed In Version:	ctdb-1.0.114.5-1.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:	821715
Environment:
Last Closed:	2013-09-23 22:32:16 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Sumit Bose 2012-05-25 09:57:05 UTC

+++ This bug was initially created as a clone of Bug #821715 +++

Description of problem:
In a CTDB cluster of 4 nodes serving 3 public addresses, one of the nodes is brought down. The ctdbd process(es) on one of other nodes crashed. 
Glusterfs* is being used as the shared filesystem hosting the lockfile for ctdb.
The "nodes" file (/etc/ctdb/nodes) is placed in the shared filesystem as well. Each node has its own /etc/ctdb/public_addresses.

Version-Release number of selected component (if applicable):
CTDB version: 1.0.114.3-3.el6

How reproducible:
always

Steps to Reproduce:
1.Build a ctdb cluster of size 4.
2.Reboot one of the nodes in the cluster.
3.On one of the machines, we see that the ctdbd crashed.
  
Actual results:
ctdbd crashes with signal 6.

Expected results:
ctdbd should not crash.

Additional info:
*Glusterfs is a network filesystem.

Comment 2 Ujjwala 2012-12-26 10:10:49 UTC

verified on the build glusterfs 3.4.0qa5 built on Dec 17 2012 and works fine.

Comment 3 Ujjwala 2012-12-28 10:45:21 UTC

The behavior is inconsistent. Following are the different behaviors observed:

In a 4 node cluster, When one of the node is powered off, 
> sometimes works fine
> One of the node goes to a banned state and stays in the banned state forever.
Only way to recover is to restart the ctdb service. ( happens 5 out of 10 times)
> On one of the node, ctdb crashes (Happened only once)

The sosreport of the node that goes to banned state and the node where ctdb crashed are available at: 
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/825180/

Tested on the build - glusterfs 3.4.0qa5 built on Dec 17

(In reply to comment #2)
> verified on the build glusterfs 3.4.0qa5 built on Dec 17 2012 and works fine.

Comment 4 Christopher R. Hertel 2013-07-18 07:06:28 UTC

Please re-test.

Gluster byte-range locking (used by CTDB when negotiating recovery) has been patched, and we have a new version of CTDB with several bugfixes applied.

Note, however, that if the cluster is not running in replicated mode, and if the one node that it shut down is also the one that "owns" the recovery lock file, then CTDB will fail to recover because the lock file is missing.  CTDB needs to have the lock file available in order to perform recovery.

Also, this bug was created as a clone of bug 821715, which has been closed.

Comment 6 surabhi 2013-07-25 12:32:45 UTC

Verified with the latest version:
glusterfs 3.4.0.12rhs.beta6

[root@dhcp159-0 ~]# rpm -qa | grep samba
samba-3.6.9-155.5.el6rhs.x86_64
samba-domainjoin-gui-3.6.9-155.5.el6rhs.x86_64
samba-winbind-clients-3.6.9-155.5.el6rhs.x86_64
samba-common-3.6.9-155.5.el6rhs.x86_64
samba-doc-3.6.9-155.5.el6rhs.x86_64
samba-glusterfs-3.6.9-155.5.el6rhs.x86_64
samba-winbind-3.6.9-155.5.el6rhs.x86_64
samba-winbind-devel-3.6.9-155.5.el6rhs.x86_64
samba-debuginfo-3.6.9-155.5.el6rhs.x86_64
samba-swat-3.6.9-155.5.el6rhs.x86_64
samba-winbind-krb5-locator-3.6.9-155.5.el6rhs.x86_64

Comment 8 Scott Haines 2013-09-23 22:32:16 UTC

Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html

Note You need to log in before you can comment on or make changes to this bug.